next up previous
Next: Timing Analysis Up: Data Caches Previous: Calculation of Virtual Addresses

Static Simulation to Produce Data Reference Categorizations

 

 
 

Figure 4: Examples for Spacial and Temporal Locality

The method of static cache simulation is used to statically categorize the caching behavior of each data reference in a program for a specified cache configuration (see Figure 1). A program control-flow graph is constructed that includes the control flow within each function and a function instance graph, which uniquely identifies each function instance by the sequence of call sites required for its invocation. This program control-flow graph is analyzed to determine the possible data lines that can be in the data cache at the entry and exit of each basic block within the program [13].

The iterative algorithm used for static instruction cache simulation [2, 13] is not sufficient for static data cache simulation. The problem is that the calculated references can access a range of possible addresses. At the point that the data access occurs, the data lines associated with these addresses may or may not be brought in cache, depending upon how many iterations of the loop has been performed at that point. To deal with this problem a new state was created to indicate whether or not a particular data line could potentially be in the data cache due to calculated references. When an immediate predecessor block is in a different loop (the transition from the predecessor block to the current block exits a loop), the data lines associated with calculated references in that loop that are guaranteed to still be in cache are unioned into the input cache state of that block. The iterative algorithm in Figure 3 is used to calculate the input and output cache states for each basic block in the program control flow.

Once these cache state vectors have been produced, they are used to determine whether or not each of the memory references within the bounded virtual address range associated with a data reference will be in cache. The static cache simulator needs to produce a categorization of each data reference in the program. The four worst-case categories of caching behavior used in the past for static instruction cache simulation were as follows. (1) Always Miss (m): The reference is not guaranteed to be in cache. (2) Always Hit (h): The reference is guaranteed to always be in cache. (3) First Miss (fm): The reference is not guaranteed to be in cache the first time it is accessed each time the loop is entered, but is guaranteed thereafter. (4) First Hit (fh): The reference is guaranteed to be in cache the first time it is accessed each time the loop is entered, but is not guaranteed thereafter. These categorizations are still used for scalar data references.

To obtain the most accuracy, a worst-case categorization of a calculated data reference for each iteration of a loop could be determined. For example, some categorizations for a data reference in a loop with 20 iterations might be as follows:

m h h h m h h h m h h h m h h h m h h h

With such detailed information the timing analyzer could then accurately determine the worst-case path on each iteration of the loop. However, consider a loop with 100,000 iterations. Such an approach would be very inefficient in space (storing all of the categorizations) and time (analyzing each loop iteration separately). The authors decided to use a new categorization called Calculated (c) that would also indicate the maximum number of data cache misses that could occur at each loop level in which the data reference is nested. The previous data reference categorization string would be represented as follows (since there is only one loop level involved):  c 5

The order of access and the cache state vectors are used to detect cache hits within calculated references due to spatial locality. Consider the two code segments in Figure 4(a) that sum the elements of a two dimensional array. The two code segments are equivalent, except that the left code segment accesses the array in row order and the right code segment uses column order (i.e., the for statements are reversed). Assume that the scalar variables (i, j, sum, and same) are allocated to registers. Also, assume the size of the direct-mapped data cache is 256 bytes with 16 cache lines containing 16 bytes each. Thus, a single row of the array a requiring 400 bytes cannot fit into cache. The static cache simulator was able to detect that the load of the array element in the left code segment had at most one miss for each of the array elements that are part of the same data line. This was accomplished by inspecting the order in which the array was accessed and detecting that no conflicting lines were accessed in these loops. The categorizations for the load data reference in the two segments are given in the same figure. Note in this case that the array happens to be aligned on a line boundary. The specification of a single categorization for a calculated reference is accomplished in two steps for each loop level after the cache states are calculated. First, the number of references (iterations) performed in the loop is retrieved. Next, the maximum number of misses that could occur for this reference in the loop is determined. For instance, at most 25 misses will occur in the innermost loop for the left code segment. The static cache simulator determined that all of the loads for the right code segment would result in cache misses. Its data caching behavior can simply be viewed as an always miss. Thus, the range of 10,000 different addresses referenced by the load are collapsed into a single categorization of c 25 2500 (calculated reference with 25 misses at the innermost level and 2500 misses at the outer level) for the left code segment and an m (always miss) for the right code segment.

Likewise, cache hits from calculated references due to temporal locality both across and within loops are also detected. Consider the code segment in Figure 4(b). Assume a cache configuration with 32 16-byte lines (512 byte cache) so that both arrays a and b requiring 400 bytes total (200 each) fit into cache. Also assume the scalar variables are allocated to registers. The accesses to the elements of array a after the first loop were categorized as an h (always hit) by the static simulator since all of the data lines associated with array a will be in the cache state once the first loop is exited. This shows the detection of temporal locality across loops. After the first complete execution of the inner loop, all the elements of b will be in cache, so then all references to it on the remaining executions of the inner loop are also categorized as hits. Thus, the categorization of c 13 13 is given. Relative to the innermost loop, 13 misses are due to bringing b into cache during the first complete execution of the inner loop. There are also only 13 misses relative to the outermost loop since b will be completely in cache on each iteration after the first. Thus, temporal locality is also detected within loops.

The current implementation of the static data cache simulator (and timing analyzer) imposes some restrictions. First, only direct-mapped data caches are supported. Obtaining categorizations for set-associative data caches can be accomplished in a manner similar to that for instruction caches described in the next section. Second, recursive calls are not allowed since it would complicate the generation of unique function instances. Third, indirect calls are not allowed since an explicit call graph must be generated statically.


next up previous
Next: Timing Analysis Up: Data Caches Previous: Calculation of Virtual Addresses

Robert Palmer
Mon May 19 10:18:36 EDT 1997