Measurements were obtained on code generated for the SPARC architecture by the vpo optimizing compiler [3]. The machine-dependent information contained the pipeline characteristics of the MicroSPARC I processor [15]. A direct-mapped data cache containing 16 lines of 32 bytes for a total of 512 bytes was used. The MicroSPARC I uses write-through/no-allocate data caching [15]. While the static simulator was able to categorize store data references, these categorizations were ignored by the timing analyzer since stores always accessed memory and a hit or miss associated with a store data reference had the same effect on performance. Instruction fetches were assumed to be all hits in order to isolate the effects of data caching from instruction caching.
Table 2 shows the test programs used to assess the timing analyzer's effectiveness of bounding worst-case data cache performance. Note that these programs were restricted to specific classes of data references that did not include any dynamic allocation from the heap. Two versions were used for each of the first five test programs. The a version had the same size arrays that were used in previous studies [2, 6]. The b version of each program used smaller arrays that would totally fit into a 512 byte cache. The number of bytes reported in the table is the total number of bytes of the variables in the program. Note that some of these bytes will be in the static data area while others will be in the run-time stack. The amount of data is not changed for the program Des since its encryption algorithm is based on using large static arrays with preinitialized values.
Table 2: Dynamic Results for Data Caching
Table 2 also depicts the dynamic results from executing the test programs. The hit ratios were obtained from the data cache execution simulation. Only Sort had very high data cache hit ratios due to many repeated references to the same array elements. The observed cycles were obtained using an execution simulator, modified from [5], to simulate data cache and pipeline affects and count the number of cycles. The estimated cycles were obtained from the timing analyzer discussed in Section 3.4. The estimated ratio is the quotient of these two values. The naive ratio was calculated by assuming all data cache references to be misses and dividing those cycles by the observed cycles.
The timing analyzer was able to tightly predict the worst-case number of cycles required for pipelining and data caching for most of the test programs. In fact, for five of them, the prediction was exact or over by less that one-tenth of one percent. The inner loop in the function within Sort that sorted the values had a varying number of iterations that depends upon a counter of an outer loop. The number of iterations performed was overrepresented on average by about a factor of two for this inner loop. The strategy of treating a calculated reference miss as a hit in the pipeline and adding the maximum number of cycles associated with the miss penalty to the total time of the path caused overestimations with the Statsa and Statsb programs, which were the only floating-point intensive programs in the test set. Often delays due to long-running floating-point operations could have been overlapped with data cache miss penalty cycles. Matmula had an overestimation of about 10% whereas the smaller data version Matmulb was exact. The Matmul program has repeated references to the same elements of three different arrays. These references would miss the first time they were encountered, but would be in cache for the smaller Matmulb when they were accessed again since the arrays fit entirely in cache. When all the arrays fit into cache there is no interference between them. However, when they do not fit into cache the static simulator conservatively assumes that any possible interference must result in a cache miss. Therefore, the categorizations are more conservative and the overestimation is larger. Finally, the Des program has several references where an element of a statically initialized array is used as an index into another array. There is no simple method to determine which value from it will be used as the index. Therefore, we must assume that any element of the array may be accessed any time the data reference occurs in the program. This forces all conflicting lines to be deleted from the cache state and the resulting categorizations to be more conservative. The Des program also has overestimations due to data dependencies. A longer path deemed feasible by the timing analyzer could not be taken in a function due to the value of a variable. Despite the relatively small overestimations detailed above, the results show that with certain restrictions it is possible to tightly predict much of the data caching behavior of many programs.
The difference between the naive and estimated ratios shows the benefits for performing data cache analysis when predicting worst-case execution times. The benefit of worst-case performance from data caching is not as significant as the benefit obtained from instruction caching [2, 6]. An instruction fetch occurs for each instruction executed. The performance benefit from a write-through/no-allocate data cache only occurs when the data reference from a load instruction is determined by the timing analyzer to be in cache. Load instructions only comprised on average 14.28% of the total executed instructions for these test programs. However, the results do show that performing data cache analysis for predicting worst-case execution time does still result in substantially tighter predictions. In fact, for the programs in the test set the prediction improvement averages over 30%.
The performance overhead associated with predicting WCETs for data caching using this method comes primarily from that of the static cache simulation. The time required for the static simulation increases linearly with the size of the data. However, even with large arrays as in the given test programs this time is rather small. The average time for the static simulation to produce data reference categorizations for the 11 programs given in Table 2 was only 2.89 seconds. The overhead of the timing analyzer averages to only 1.05 seconds.