Linear-time Modeling of Program Working Set in Shared Cache

Linear-time Modeling of Program Working Set in Shared Cache Xiaoya Xiang, Bin Bao, Chen Ding, Yaoqing Gao Computer Science Department, University of Rochester IBM Toronto Software Lab {xiang,bao,cding}@cs.rochester.edu, ygao@ca.ibm.com Abstract Many techniques characterize the program working set by the notion of the program footprint, which is the volume of data accessed in a time window. A complete characterization requires measuring data access in all O(n 2 ) windows in an n- element trace. Two recent techniques have significantly reduced the measurement time, but the cost is still too high for real-size workloads. Instead of measuring all footprint sizes, this paper presents a technique for measuring the average footprint size. By confining the analysis to the average rather than the full range, the problem can be solved accurately by a linear-time algorithm. The paper presents the algorithm and evaluates it using the complete suites of 26 SPEC2000 and 29 SPEC2006 benchmarks. The new algorithm is compared against the previously fastest algorithm in both the speed of the measurement and the accuracy of shared-cache performance prediction. Keywords: Footprint, Cache sharing I. INTRODUCTION During a program execution, its working set can be defined as the footprint, which is the volume of data accessed in an execution window. Since the footprint shows the active data usage, it has been used to model resource sharing among concurrent tasks and to improve throughput and enforce fairness, either in memory sharing among multiprogrammed workloads or more recently in cache sharing among multicore workloads. A trace of n data accesses has ( ) n 2 n(n ) 2 distinct windows and therefore n(n ) 2 footprints. Early studies measured program footprints in the shared cache of time-sharing systems. Since applications interact between time quanta, it is sufficient to consider just the windows of a single length the length of a scheduling quantum [20], [22]. On today s multicore systems, however, programs interact continuously. A number of techniques were developed to estimate the footprint in all-length windows, but they did not guarantee the precision of the estimation [2], [3], [6], [8], [9], [2]. A recently published technique called all-window footprint analysis can measure all footprints in O(CKlogM) time, where CK is linear to the length of the trace and M is the volume of data accessed in the trace [23]. For each window length, the analysis shows the maximum size, the minimum size, and the size distribution of footprints in all windows of this length [23]. The analysis is not fully accurate but guarantees a relative precision, e.g. 99%. We call the analysis all-footprint analysis, because it measures the size of every footprint. In this paper, we present average-footprint analysis. For each window length, the analysis shows the average size of footprints in all windows of this length. While the analysis gives the accurate average, it does not measure the range or the distribution. However, a weaker analysis can often be done faster. Indeed, we show that the average footprint can be measured accurately in linear time O(n) for a trace of length n, regardless of the data size. The average footprint is a function mapping from the length of a execution window to the volume of its data access. Intuitively, the working set increases in larger execution windows. We prove that the average footprint is monotonically nondecreasing. The new analysis precisely quantifies the growth of the average footprint over time. The previous, all-footprint analysis was the key metric used in the composable models of cache sharing [3], [6], [2], [23]. For P programs, there are 2 P co-run combinations. A composable model makes 2 P predictions using P singleprogram runs rather than 2 P parallel runs. As an alternative to all-footprint analysis, the new average-footprint analysis can be used in the composable model to reduce the (footprint) measurement cost asymptotically. To evaluate the speed and usefulness of the averagefootprint analysis, we test it on the complete suites of SPEC 2000 and SPEC 2006 CPU benchmarks and compare the results with the fastest all-footprint analysis [23]. To measure the accuracy of cache sharing prediction, we rank the slowdowns in two- and three-program co-runs on a quad-core machine and compare the predicted ranking with exhaustive testing. Through experiments, we show that the average-footprint analysis can predict the effect of cache interference as accurately as the all-footprint analysis, yet at only a fraction of its cost. In fact, the cost of all-footprint analysis is too high for it to model SPEC 2006 benchmarks, which have up to.9 trillion accesses to up to GB data. In comparison, the average-footprint analysis can model all SPEC 2006 benchmarks, finishing most of the programs within a few hours of time. This study has two limitations. First, we are concerned with parallel workloads consisting of only sequential programs that do not share data. We do not consider parallel programs, although similar footprint metrics have been studied to model multi-threaded workloads [6], [7]. Second, the footprint results are input specific, so they are useful mostly in workload characterization, for example, finding the most and the least interference among a set of benchmark programs.

rd 5 b) Footprint windows and the cache sharing model: Offline cache relative- sharing models O(TlogN) were pioneered O(CKlogN) by Chandra et al. [3] thread A a bcdefa and Suh precision et al. [2] for algorithm a group of independent algorithmprograms and approx. ft 4 extended for multi-threaded code by Ding and Chilimbi [6], Schuff et al. [7], and thread B O(TN) Jiang et al. O(CKN) [2] Let A, B be two kmmmnon programs accurate that share algorithm the same cachealgorithm but do not shared data, the effect of B on the locality of A is rd rd+ft 9 thread A&B accurate constantprecision with B) a k bcm d m e m f nona P (capacity miss by A when co-running approximation (a) In shared cache, the reuse distance in program A is P ((A s reuse distance + B s footprint) cache size) lengthened by the footprint of program B. Given an execution all-window window in statistics a sequential trace, the prog. B In shared cache, B the and reuse A co-execution distance in thread (b) footprint four algorithms is the number for measuring of distinct footprint elements in accessed all in the A is lengthened by the footprint of thread B. execution windows in a trace. T is the length of --00--00 ----33--33-- window. The examples in Figure (a) illustrates the interaction trace and N the largest abbaadaccc axbybyaxaxdwaxczczcz between locality and footprint. In the first example, a reuse 4 cache misses on 3-element window in program A concurs with a time window in program fully associative LRU cache. B. The reuse distance of A is lengthened by the footprint of prog. A B s window. The second example uses two pairings of three xyyxxwxzzz traces to show that the shared cache miss rate depends also B2 and A co-execution on the footprint, not just the miss rate of co-run threads. -----332-2-- An implication of the cache sharing model is that cache prog. B2 axbycycxbxcwcxczdzdz ---000-0 interference is asymmetric for programs with different locality 2 cache misses on 3-element abccbcccdd fully associative LRU cache. and footprints. A program with large footprints and short reuse distances may disproportionally slow down other programs (b) Programs B and B2 have the same miss rate. However, A and while experiencing little or no slowdown itself. This was B incur 50% more misses in shared cache than A and B2. The difference is caused not by data reuse but by data footprint. observed in experiments [23], [24]. In one program pair, the first program shows a near 85% slowdown while the other Fig.. Example illustrations of cache sharing. Programs B and B2 have the same miss rate. However, A-B incurs 50% more program shows only a 5% slowdown. misses in shared cache than A-B2. The difference is caused not by data reuse but by footprint. II. BACKGROUND ON OFF-LINE CACHE MODELS Off-line cache models do not improve performance directly but can be used to understand the causes of interference and to predict its effect before running the programs (so they may be grouped to reduce interference). Off-line analysis measures the effect of all data accesses, not just cache misses. It characterizes a single program unperturbed by other programs and the analysis itself. Such clean-room metrics avoid the chicken-egg problem when programs are analyzed together: the interference depends on the miss rate of corunning programs, but their miss rate in turn depends on the interference. Next we describe first the locality model of private cache and then the model for shared cache. a) Reuse windows and the locality model: For each memory access, the temporal locality is determined by its reuse window, which includes all data accesses between this and the previous access to the same datum. Specifically, whether the access is a cache (capacity) miss depends on the reuse distance, the number of distinct data elements accessed in the reuse window. The relation between reuse distance and the miss rate has been well established [9], [6]. The capacity miss rate can be defined by a probability function involving the reuse distance and the cache size. Let the test program be A. P (capacity miss by A alone) P (A s reuse distance cache size) single-window statistics III. THE MEASUREMENT OF AVERAGE FOOTPRINT A. Definitions Let W be the set of ( n 2) windows of a length-n trace. Each window w < t, v > has a length t and a footprint v. Let I(p) be a boolean function returning when p is true and 0 otherwise. The footprint function f p(t) averages over all windows of length t: w fp(t) v i W ii(t i t) w w I(t v i W ii(t i t) i W i t) n t + For example, the trace abbb has 3 windows of length 2: ab, bb, and bb. The corresponding footprints are 2,, and, so fp(2) (2 + + )/3 4/3. B. O(n) Algorithm There is a linear-time algorithm that calculates the precise average footprint for all execution windows of a trace. Let n, m be the length of the trace and the number of distinct data used in the trace. The algorithm first measures the follow three quantities: the distribution of the time distances of all data reuses (n m distances) the first-access times of all distinct data (m access times) the last-access times of all distinct data (exact definition later, m access times) The three quantities can be measured by a single pass over the trace using a hash table with one entry for each distinct data. The cost is linear, O(n) in time and O(m) in space. 2

The three measures are the inputs to a formula f p(w). For any window size w(0 < w N), fp(w) computes the average footprint for all windows of size w. In other words, the formula computes the average footprint for windows of all sizes without having to inspect the trace again. In the rest of the section, we derive the formula and discuss its complexity. The main idea of the formula is differential counting, which counts the difference in the footprint between consecutive windows. For any window size w, we start with the footprint in the first window and then compute its increase or decrease as the window moves forward in the trace. The first-access times are sufficient to compute the footprint of the first window. The change in later windows depends on two metrics on each trace element d i : the forward time distance fwd(d i ) and the backward time distance bwd(d i ). Let datum x be accessed at d i. Let the closest accesses of x be d j before d i and d k after d i. Then fwd(d i ) k i and bwd(d i ) i j. The forward and backward time distances determine the change of footprint between consecutive windows. The relation is shown in Figure 2. di-w di-w+ fp(i-w) fp(i-w+) bwd(di) di- di fp(i) fwd(di) di+ fp(i+) di+w- di+w Fig. 2. An illustration how the forward and backward (reuse) time distance influences the change in footprint between consecutive windows Let the footprint of a w-size window starting at i be fp(i). Each element d i in the trace affects the footprint of w windows: fp(i w+), fp(i w+2),..., fp(i). In differential counting, we consider only the effect of d i on two pairs of windows: the change from fp(i w) to fp(i w + ) when d i enters into its first window and the change from fp(i) to fp(i + ) when d i exits from its last window of influence. Figure 2 shows d i and the two pairs of windows where d i enters between the first pair and exits between the second pair. When d i enters, it does not increase the footprint fp(i w) if the same datum was previously accessed within f p(i w +), which means that its backward time distance is no greater than w (bwd(d i ) w). This is the case illustrated in Figure 2. Otherwise, d i adds to the footprint fp(i w). Similarly, when d i exits from fp(i), the departure does not change fp(i + ) if fwd(d i ) w; otherwise, it subtracts from fp(i + ), as in the case illustrated in Figure 2. The footprint f p(i + ) depends on three factors: the footprint fp(i), the contribution of the entering d i+w, and the detraction of the exiting d i. The footprint of all windows is then computed by adding these differences. Next we formulate this computation. We use the following notations. n, m, w: the length of the trace, the size of data, and the window size of interest d i : the i-th trace access fp(i): the footprint of the window from d i to d i+w (including d i and d i+w ) dk bwd(d i ): the backward reuse time distance of d i, if d i is the first access. fwd(d i ): the forward reuse time distance of d i, if d i is the last access. I(p): a boolean function that returns if p is true and 0 otherwise. For example, I(bwd(d i ) > w) gives the contribution by d i, which is if bwd(d) > w and 0 otherwise. Similarly, I(fwd(d i ) > w) gives the detraction of d i, if fwd(d) > w and 0 otherwise. The total size of the footprints in all windows of length w, when divided by the number of windows n w +, is the average footprint, as shown next in Equation. Since fp(w) n w+ X fp(i) () n w + fp(i + ) fp(i) + I(bwd(d i+w ) > w) I(fwd(d i ) > w) (2) Expanding Equation using Equation 2, we have three components in the average footprint: fp(w) fp() + n iw+ n w n w + ( (n i + )I(bwd(d i ) > w) (n i + w)i(fwd(d i ) > w)) (3) Next we compute each component separately. The footprint of the first window of length w is fp() w I(bwd(d i ) ) (4) In the next component, we split the forward time distances into two groups: finite and infinite distances. The summation order of the finite distances can be changed from to n instead of from w + to n. n iw+ n iw+ + n iw+ (n i + )I(bwd(d i ) > w) (5) (n i + )I(w < bwd(d i ) < ) (n i + )I(bwd(d i ) ) n (n i + )I(w < bwd(d i ) < ) + n iw+ (n i + )I(bwd(d i ) ) Similarly, we decompose and simplify the forward distances: 3

(n i + w)i(fwd(d i ) > w) (6) n w n (n i + w)i(w < fwd(d i ) < ) n w + (n i + w)i(fwd(d i ) ) Combining the Equations 4, 5, and 6, we can now expand Equation 3. Instead of using individual accesses, we now use the three inputs, defined as follows: f i : the first access time of the i-th datum l i : the reverse last access time of the i-th datum. If the last access is at position x, l i n + x, that is, the first access time in the reverse trace. r t : the number of accesses with a reuse time distance t fp(w) n I(bwd(d i ) ) + n w + ( n (w i)i(bwd(d i ) ) iw+ n w + (n i + w)i(fwd(d i ) ) n (n i + )I(w < bwd(d i ) < ) n (n i + w)i(w < fwd(d i ) < )) M m + n w + ( (w f i )I(f i > w) + M (w l i )I(l i > w) n + (w t)i(t > w)r t ) t m m n w + ( (f i w)i(f i > w) m + (l i w)i(l i > w) + n tw+ (t w)r t ) (7) The formula of Equation 7 passes the sanity check that the average footprint fp(w) is at most the data size m, and the footprint of the whole trace (w n) is m. Fixing the window length w and ignoring the effect of first and last accesses, we see that the footprint decreases if more reuse time distances (r t ) have larger values (t). This suggests that improving locality reduces the average footprint. For example, if we double the length of a trace by repeating each element twice, the length of the long time distances would double, and the average footprint would drop. For each window length w, the Equation 7 can be computed in time O(w). If we limit to consider only window sizes of a logarithmic scale, the formula can be represented and evaluated in O(log w) time. C. Monotonicity Theorem 3.: The average footprint f p(w) is nondecreasing. Proof: Let wi k denotes the i-th window whose size is k, f(wi k ) denotes the footprint of the i-th window whose size is k. We prove that, k, 0 < k n, fp(k + ) fp(k). First, i, 0 < i n k, the following holds because wi k and wi+ k are both contained in wk+ i : f(w k+ i ) f(w k i ) f(w k+ i ) f(w k i+ ) In addition, we have k, 0 < k n, j, 0 < j n k +, such that, f(wj k ) fp(k). Now then fp(k + ) n k f(w k+ i ) n k j n k [ n k f(w k+ i ) + ij j n k [ n k f(wi k ) + ij f(w k+ i )] f(w k i+)] j n k [ n k+ f(wi k ) + f(wi k )] ij+ n k+ n k [ f(wi k ) f(wj k )] n k [(n k + )fp(k) f(wk j )] fp(k) + n k [fp(k) f(wk j )] fp(k) IV. AVERAGE FOOTPRINT IN THE COMPOSABLE MODEL Our previous work used all-footprint analysis in the composable model to predict cache interference [23]. In the composable model, when multiple programs are run together, each reuse distance in a program is lengthened by the aggregate footprint of all peer programs over the same time window. Suppose there are n programs t, t 2,..., t n running on a shared cache, the miss rate is computed by P (capacity miss by t i running with t j, j,..., n, j i ) P ((t i s reuse distance + j i t j s footprint) cache size) Suppose the distribution of program t i s reuse distance is D rd (t i ), and the distribution of program t i s footprint of 4

window size w is D fp (t i, w). The first distribution is defined as D rd (t i ) {< x ki, p ki > p k } where < x ki, p ki > means the probability of the reuse distance equals x ki is p ki. Similarly, we define D fp (t i, w) {< yk w i, qk w i > qk w i } Given a window size w, we use < yk w i, qk w i > to mean that the probability that the footprint equals yk w i is qk w i. Consider a 2-program co-run involving t and t 2. The capacity miss rate by t is calculated as follows by Equation 8. mr(t ) k k 2 p k q w(xk ) k 2 I(x k + y w(x k ) k 2 C) (8) where I is the identity function, and w(x k ) is the size of the reuse window that contains the reuse distance x k. This is the equation employed by all-footprint based modeling [23]. To use average-footprint analysis instead, we define the average footprint of a window size w for program t i as F (t i, w) fi w. Equation 8 can be simplified to Equation 9. mr(t ) k p k I(x k + f w(x k ) 2 C) (9) The estimation of the execution time from the miss rate is the same as [23]. The only difference is that the previous model uses all-footprint analysis and Equation 8, and the new model uses average-footprint analysis and the simpler Equation 9. A. Experimental Setup V. EVALUATION We have implemented the average-footprint analysis algorithm in a profiling tool and tested 26 SPEC2K benchmarks, 2 integer and 4 floating-point, and 29 SPEC2006 benchmarks, 2 integer and 7 floating-point. All benchmarks are instrumented by Pin [5] and profiled on a machine with an Intel Core i5-660 processor and 4GB physical memory. The machine is set up with Fedora 3 and GCC 4.4.5. The twoprogram co-run results for SPEC 2000 are collected on an Intel Core 2 Duo machine with two 2.0GHz cores sharing 2MB L2 cache and 2GB memory. In order to measure 3-program coruns, we use an Intel quad-core machine, with four 2.27GHz cores sharing 8MB L3 cache and 8GB memory. Except in Section V-G when we examine the effect of input, we use the reference input in the test. Some programs, especially SPEC 2006, have multiple reference inputs. We use the first one tested by the auto-runner. In performance comparisons, the base program run time is one without Pin instrumentation or any other analysis. The length of SPEC 2000 traces ranges from 4 billion in gcc to 425 billion in mgrid. The amount of data ranges from 3 thousand 64-byte cache blocks (MB) in eon to 3.2 million cache blocks (256MB) in gcc. The SPEC 2006 traces on average are 0 times as long as SPEC 2000 traces and have 5 times as many cache blocks. The trace bwaves is the longest with.9 trillion data accesses and has the most data, 928MB. The individual statistics of the 55 programs is listed in Table II. To evaluate cache-sharing predictions, we run two experiments: 2-program co-runs. We predict all 2-program co-runs and compare the predicted ranking with that of the previous work using the 5 SPEC 2000 benchmarks used in the previous work [23]. 3-program co-runs. We started with the 0 representative benchmarks in SPEC2006 as selected by Zhuravlev et al. [29]. Reuse-distance analysis was too slow to measure 2 programs. We evaluate the prediction for all program triples of the remaining 8 programs. In both tests, we also compare with a simple prediction method based on miss rates (by ranking the total miss rate of the programs in the co-run group) [23]. B. Efficiency of Average-footprint Analysis Table I summarizes the analysis cost for the two benchmark suites, and for each suite, the average for integer and for floating-point programs. It divides the 55 tests into four groups: 2 SPEC 2000 integer programs, 4 SPEC 2000 floating-point programs, 2 SPEC 2006 integer programs, and 7 SPEC 2006 floating-point programs. The result of each group is summarized in three rows and three columns. The columns show the trace length, the data size, and the slowdown ratio of the profiling time to the unmodified run time. The rows show the minimum, maximum, and the average slowdown factors for all benchmarks of the group. The minimum slowdowns in four benchmark groups are all below 0. The maximum slowdowns are 40, 32, 4, and 74. The average slowdowns are between 2 in SPEC 2006 integer tests and 29 in SPEC 2000 integer tests. On average across all four groups, average-footprint analysis takes no more than 30 times of the original execution time. The individual results of the 55 programs are shown in Table II. Compared to the summary table, the individual-result table has two additional columns, which show the unmodified execution time and the time of average-footprint analysis. The unmodified time measures the execution of the original program without any instrumentation or analysis. On average, an unmodified SPEC 2000 program takes less than 3 minutes, and an unmodified SPEC 2006 program takes close to 0 minutes. Average-footprint analysis takes 3 to 73 minutes for SPEC 2000 programs and 0 minutes (gcc) to 0 hours (calculix) for SPEC 2006 programs. C. Comparison with All-footprint Analysis All-footprint analysis can analyze SPEC 2000 programs but not SPEC 2006 programs. We compare average- and allfootprint analysis on SPEC 2000 programs in Table III. SPEC 2000 has 26 programs in total. The paper on all-footprint analysis reported results for 5 of the programs [23]. The table summarizes the cost of the two analyses in these 5 tests in the last two columns. The slowdowns by averagefootprint analysis are between 8.8 and 40. The slowdowns by all-footprint analysis are between 248 and 360. The average slowdown is 40 for average-footprint analysis and about 500 5

benchmarks stats trace length data size(64b lines) avg-fp slowdown(x) SPEC2000 INT min.4 E+0 0.3 E+5 8.8 2 programs max 6.05 E+0 32.55 E+5 39.7 mean 7.52 E+0.67 E+5 28.8 SPEC2000 FP min 3.03 E+0 0.56 E+5 9.7 4 programs max 42.55 E+0 3.28 E+5 32.4 mean 7.44 E+0 4.6 E+5 2.9 SPEC2006 INT min 4.88 E+0 3.0 E+5 8. 2 programs max 5.47 E+0 37.36 E+5 40.9 mean 47.20 E+0 34.05 E+5 20.7 SPEC2006 FP min 20.73 E+0 0.40 E+5 9.4 7 programs max 385.5 E+0 45.03 E+5 73.9 mean 38.85 E+0 58.69 E+5 26. TABLE I THE MIN, MAX, AND AVERAGE COSTS (SLOWDOWNS) OF AVERAGE-FOOTPRINT ANALYSIS FOR 55 SPEC 2000 AND SPEC 2006 BENCHMARK PROGRAMS for all-footprint analysis. In other words, on average for these 5 programs, average-footprint analysis is 38 times faster than all-footprint analysis. All-footprint analysis takes too long for SPEC 2006 programs. For example, it takes average-footprint analysis 0 hours to profile calculix. Being 38 times slower, it would take more than two weeks to measure the all-footprint distribution. D. Two-Program Co-run Ranking The prior work showed co-run ranking results for 5 SPEC 2000 programs based on all-footprint analysis and compared with miss-rate based ranking and exhaustive testing [23]. We now show the ranking results using average-footprint analysis and compare it with the three previous ranking methods. We show the prediction results in a 2-D plot. The x-axis is the rank of program co-run groups. In this test, the rank ranges from (the least interfering pair) to 05 (the most interfering pair). The y-axis shows the interference, measured by the quadratic mean of the slowdowns of programs in the co-run group. The slowdown of a co-run program is the ratio of its co-run time and the time running alone on the same machine (cache). For two programs with slowdowns s, s 2, s 2 +s2 2 2. we have y The three graphs in Figure 3 show the plots for the predictions based on miss rate, all-footprint analysis, and averagefootprint analysis. In each plot, the accurate result from exhaustive testing is shown by a monotonically increasing red line as a reference. The simple miss-rate based prediction does not show an increasing trend, suggesting no correlation between the prediction and the actual interference. The two footprint-based predictions show significant correlation. Programs predicted to have a high interference tend to actually have a high interference. Average-footprint analysis ranks several program pairs better than all-footprint analysis. Consider the pair with the highest interference, art,mcf with a slowdown of 2. The pair is ranked 23 by miss rate, 99 by all-footprint analysis, and 05 by average-footprint analysis. The average-footprint rank is precisely correct. All-footprint ranking has a significant misprediction for the program pair gcc,art. The pair slows down each other by.6 times. It should be ranked 97 but ranked 44 by all-footprint analysis, which is worse than the miss-rate rank 70. The rank by average-footprint analysis is relatively the best at 86. E. Three-Program Co-run Ranking Evaluating larger group co-runs is difficult because the number of tests increases exponentially with the size of corun group. To test all 3-program co-runs in SPEC 2006 benchmarks, we would have to run ( ) 27 3 2925 tests. Even if we ran all the tests, it would have been impossible to show the results clearly. Fortunately, Zhuravlev et al. have analyzed the benchmark set based on the cache miss rates and access rates and identified 0 representatives [29]. We had to narrow down further because the reuse-distance analysis could finish only for 8 out of the 0 representatives: 403.gcc, 46.gamess, 429.mcf, 444.namd, 445.gobmk, 450.soplex, 453.povray, and 470.lbm. There are 56 different 3-program groups from these 8 benchmarks. We show the prediction results in Figure 4. The results for 3-program co-runs of SPEC 2006 programs are similar to those of 2-program co-runs of SPEC 2000 programs. As before, the miss-rate based prediction does not show a detectable correlation while the average-footprint analysis shows a clear correlation with the actual interference. The maximal slowdown increases from 2.0 in 2-program coruns to 3.3 in 3-program co-runs, confirming the expectation that the interference becomes worse as the cache is shared by more programs. Exhaustive testing is also increasingly infeasible. For both reasons, the composable model is more valuable, so is the higher efficiency from the average-footprint analysis. F. Rank and Performance Closeness To quantify the difference between the predicted ranking and the accurate ranking, we define two metrics: the rank closeness and the performance closeness. The rank closeness shows on average how the predicted rank of a co-run group differs from the actual rank. We number n co-run groups by their accurate rank i. Let pred(i) be the predicted rank for group i. The rank closeness is defined as n pred(i) i rank closeness n 6

slowdown.0.5 2.0 2.5 miss rate exhaustive testing slowdown.5 2.0 2.5 3.0 3.5 miss rate exhaustive testing 0 20 40 60 80 00 0 0 20 30 40 50 ranked program pairs (from least interference to most interference) ranked program triples (from least interference to most interference) slowdown.0.5 2.0 2.5 all footprint exhaustive testing slowdown.5 2.0 2.5 3.0 3.5 average footprint exhaustive testing 0 20 40 60 80 00 0 0 20 30 40 50 ranked program pairs (from least interference to most interference) ranked program triples (from least interference to most interference) slowdown.0.5 2.0 2.5 average footprint exhaustive testing 0 20 40 60 80 00 ranked program pairs (from least interference to most interference) Fig. 3. Evaluation of 2-program co-run predictions for 5 SPEC2000 benchmark programs. The prediction quality of average-footprint analysis is similar to that of all-footprint analysis. The formula is the Manhattan distance between two vectors < p(), p(2),..., p(n) > and <, 2,..., n >, divided by n. The worst possible ranking has a rank-closeness score of n/2 if n is even or (n )/2 otherwise. Next we quantify the error in terms of the mis-predicted slowdown. Let f(i) be the slowdown of the co-run group with the accurate rank i, and f(pred(i)) be the slowdown Fig. 4. Comparison of 3-program co-run predictions for 8 SPEC2006 benchmarks. All-footprint analysis cannot model these programs because of its high cost. of the co-run group with the predicted rank i. The difference f(pred(i)) f(i) gives the mis-prediction. The performance closeness is the average mis-prediction for all groups: n f(pred(i)) f(i) performance closeness n The two metrics are shown in Table IV. On average for 2- program co-runs, the miss-rate rank errs by 35 positions, while the footprint-based ranks err by 4 and 5 positions. For 3- program co-runs, the miss-rate rank errs by 9 positions, while the average-footprint rank errs by 6. In terms of performance, the miss-rate based ranking mis-predicts twice as bad as the footprint-based ranking. In search of a closeness metric, we also measured the Levenshtein distance. For two permutations of a set of numbers, the Levenshtein distance measures the number of edits needed to convert one to the other. For the 2-program co-run test, the distance is 03 for miss rate, 97 for all-footprint, and 96 for average-footprint. For the 3-program co-run test, the distance is 54 for miss rate and 48 for average-footprint. Levenshtein is not a good metric since it does not distinguish a ranking that does not show a correlation from rankings that do. 7

2-program co-run over 5 SPEC2000 benchmarks ranking strategy perf closeness rank closeness miss rate 0.385 35 all-footprint 0.94 5 avg-footprint 0.70 4 3-program co-run over 0 SPEC2006 benchmarks ranking strategy perf closeness rank closeness miss rate 0.632 9 avg-footprint 0.225 6 TABLE IV COMPARE DIFFERENT RANKING STRATEGIES G. The Effect of Input on Average Footprint The footprint of a program execution is affected by the program input just as the length of the execution is affected by the input. An important question for profiling-based techniques is how much the footprints in training runs may differ from those in test runs. In this section, we give a preliminary measure of this difference. Given a set of k executions of the same program, we quantify the variation between the k footprint functions (f i (w)) as follows. First, we compute the average of the average footprints: f(w) k k f i(w). Then we compute the Manhattan distance between the i-th execution and the average as: d i W j fi(wj) f(w j) ) W where W is the number of different window lengths. A Manhattan distance of x% means that on average, the input i s footprint function differs from the average by x% in each window size. Table V shows SPEC2000 and SPEC2006 programs, the number of inputs (provided by the benchmark suite and tested in our experiments), the range of trace lengths and data sizes in these inputs, the smallest and largest Manhattan distances as we just defined. The majority of programs, 20 out of 37, see no more than 30% difference between footprints of different inputs. The minimal difference is less than 20% in all but 5 programs. Note that the effect of the input may be predicted using model fitting based on the input characteristics [26]. This is outside the scope of this paper. VI. RELATED WORK Locality models: Locality in private cache can be modeled by reuse distance, which can be measured with a guaranteed precision in time O(n log 2 m), where n is the length of the trace and m is the size of data [26]. Reuse distance has found many uses in workload characterization and program optimization [26]. There are a number of recent developments. Chauhan and Shei gave a method for static analysis of locality in MATLAB code [4]. Unlike profiling whose results are usually input specific, static analysis can identify and model the effect of program parameters. Most previous models targeted program analysis. Ibrahim and Strohmaier used synthetic probes to emulate the locality of an application for efficient machine characterization [0]. Zhou studied random cache replacement policy and gave a one-pass deterministic traceanalysis algorithm to compute the average miss rate (instead of simulating many times and taking the average) [28]. Finally, Schuff et al. defined multicore reuse distance analysis and improved its efficiency through sampling and parallelization [7]. The sampling was based on a method developed by Zhong and Chang earlier [25]. These techniques are concerned with only reuse windows and cannot measure the footprint in all execution windows, which is the problem addressed in this paper. Off-line cache sharing models: The average working-set size in single-length execution windows such as a scheduling quantum can be computed in linear time. It has been used in studying multi-programmed systems [7], [22]. In a parallel environment such as today s multicore processors, programs interact constantly. The interference in all-length windows has been considered for memory [2] and for cache [3]. Both used the following recursive equation involving the working set and the miss rate. As a window of size w is extended to w +, the change in the working set depends on whether the new access is a miss. Suh et al. assumed linear function growth when window sizes were close [2]. Chandra et al. computed the recursive relation bottom up [3]. The same problem has been solved using statistical inference. Two techniques by Berg and Hagersten (known as StatCache) [2] and by Shen et al. [9] were used to infer cache miss rate from the distribution of reuse times. Berg and Hagersten assumed constant miss rate over time and random cache replacement [2], and Shen et al. assumed a Bernulli process and LRU cache replacement [8], [9]. The latter method was adapted to predict cache interference [2]. A precise prediction was shown useful in an approximately solving the optimal coscheduling problem []. However, none of these method can bound the approximation error. Our earlier work gave the first precise methods for measuring the footprint [5], [23] and an iterative model for the circular effect of cache interference [23]. The linear-time algorithm in this paper computes the average rather than the full distribution and improves the measurement speed by near 40 times yet maintains a similar accuracy in shared-cache locality prediction. On-line models: The miss rate curve has been used for memory partitioning to ensure fairness or maximize throughput in a parallel workload [27]. Similarly, reuse distance has been used for cache partitioning among data objects [4]. Recently, Zhuravlev et al. reviewed four models based on the miss rate and the reuse distance [29]. As on-line models, these techniques did not consider the working set metrics because of the cost. For example, Zhuravlev et al. considered a less accurate model from Chandra et al. because for efficiency it did not require all-window footprints [3]. Zhuravlev et al. showed that cache sharing is one of the factors but not necessarily the major factor [29]. Still, an accurate and fast solution may help to quantify the contribution from cache sharing in the overall interference. Analytical models and streaming analysis: Counting the number of distinct data items has been considered as a 8

streaming analysis problem. Space-efficient (less than O(m)) solutions exist to measure frequency moments F 0 (footprint), F (total frequency), F 2, F (most frequent item), and entropy [], [8], [3]. Instead of counting the F 0 moment over the whole trace, we solve the problem of collecting the average F o for all execution windows and focus on reducing the time complexity from O(n 2 ) to linear. Streaming solutions may be combined to further reduce the space requirement of our algorithms. VII. SUMMARY Complete characterization of footprint requires measuring data access in all execution windows. In this paper, we have presented the average footprint as a metric of all-window footprint. The footprint function maps from time to average footprint, We have shown that the average footprint function is monotone and can be used in the composable model to rank cache interference in shared cache without having to test any parallel executions. We have presented a linear-time algorithm for accurately measuring the average footprint. The linear-time algorithm uses differential counting based on the forward and backward reuse time distance. When tested on SPEC CPU 2000 benchmarks, the average-footprint analysis is on average 38 times faster than the previous, all-footprint analysis, yet it shows comparable accuracy in shared-cache locality prediction. The average-footprint analysis was efficient enough to measure the newer, SPEC 2006 benchmarks, but the all-footprint analysis could not. ACKNOWLEDGMENT We would like to thank Tongxin Bai for providing histogram mapping libraries. The presentation has been improved by the suggestions from Xipeng Shen and the systems group at University of Rochester. Xiaoya Xiang and Bin Bao are supported by two IBM Center for Advanced Studies Fellowships. The research is also supported by the National Science Foundation (Contract No. CCF-604, CCF-0963759, CNS-0834566). REFERENCES [] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the ACM Symposium on Theory of Computing, pages 20 29, 996. [2] E. Berg and E. Hagersten. Fast data-locality profiling of native execution. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems, pages 69 80, 2005. [3] D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In Proceedings of the International Symposium on High-Performance Computer Architecture, pages 340 35, 2005. [4] A. Chauhan and C.-Y. Shei. Static reuse distances for locality-based optimizations in MATLAB. In International Conference on Supercomputing, pages 295 304, 200. [5] C. Ding and T. Chilimbi. All-window profiling of concurrent executions. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008. poster paper. [6] C. Ding and T. Chilimbi. A composable model for analyzing locality of multi-threaded programs. Technical Report MSR-TR-2009-07, Microsoft Research, August 2009. [7] B. Falsafi and D. A. Wood. Modeling cost/performance of a parallel computer simulator. ACM Transactions on Modeling and Computer Simulation, 7():04 30, 997. [8] P. Flajolet and G. Martin. Probabilistic counting. In Proceedings of the Symposium on Foundations of Computer Science, 983. [9] M. D. Hill and A. J. Smith. Evaluating associativity in CPU caches. IEEE Transactions on Computers, 38(2):62 630, 989. [0] K. Z. Ibrahim and E. Strohmaier. Characterizing the relation between Apex-Map synthetic probes and reuse distance distributions. Proceedings of the International Conference on Parallel Processing, 0:353 362, 200. [] Y. Jiang, X. Shen, J. Chen, and R. Tripathi. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 220 229, 2008. [2] Y. Jiang, E. Z. Zhang, K. Tian, and X. Shen. Is reuse distance applicable to data locality analysis on chip multiprocessors? In Proceedings of the International Conference on Compiler Construction, pages 264 282, 200. [3] A. Lall, V. Sekar, M. Ogihara, J. Xu, and H. Zhang. Data streaming algorithms for estimating entropy of network traffic. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems, pages 45 56, 2006. [4] Q. Lu, J. Lin, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Soft-OLP: Improving hardware cache performance through softwarecontrolled object-level partitioning. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 246 257, 2009. [5] C.-K. Luk et al. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, Chicago, Illinois, June 2005. [6] R. L. Mattson, J. Gecsei, D. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM System Journal, 9(2):78 7, 970. [7] D. L. Schuff, M. Kulkarni, and V. S. Pai. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 53 64, 200. [8] X. Shen and J. Shaw. Scalable implementation of efficient locality approximation. In Proceedings of the Workshop on Languages and Compilers for Parallel Computing, pages 202 26, 2008. [9] X. Shen, J. Shaw, B. Meeker, and C. Ding. Locality approximation using time. In Proceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 55 6, 2007. [20] H. S. Stone, J. Turek, and J. L. Wolf. Optimal partitioning of cache memory. IEEE Transactions on Computers, 4(9):054 068, 992. [2] G. E. Suh, S. Devadas, and L. Rudolph. Analytical cache models with applications to cache partitioning. In International Conference on Supercomputing, pages 2, 200. [22] D. Thiébaut and H. S. Stone. Footprints in the cache. ACM Transactions on Computer Systems, 5(4):305 329, 987. [23] X. Xiang, B. Bao, T. Bai, C. Ding, and T. M. Chilimbi. All-window profiling and composable models of cache sharing. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 9 02, 20. [24] X. Zhang, S. Dwarkadas, and K. Shen. Towards practical page coloringbased multi-core cache management. In Proceedings of the EuroSys Conference, 2009. [25] Y. Zhong and W. Chang. Sampling-based program locality approximation. In Proceedings of the International Symposium on Memory Management, pages 9 00, 2008. [26] Y. Zhong, X. Shen, and C. Ding. Program locality analysis using reuse distance. ACM Transactions on Programming Languages and Systems, 3(6): 39, Aug. 2009. [27] P. Zhou, V. Pandey, J. Sundaresan, A. Raghuraman, Y. Zhou, and S. Kumar. Dynamic tracking of page miss ratio curve for memory management. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages 77 88, 2004. [28] S. Zhou. An efficient simulation algorithm for cache of random replacement policy. In Proceedings of the IFIP International Conference on Network and Parallel Computing, pages 44 54, 200. Springer Lecture Notes in Computer Science No. 6289. [29] S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing shared resource contention in multicore processors via scheduling. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages 29 42, 200. 9

benchmark trace length data size(64b lines) unmodified time(sec) avg-fp analysis time FP alg cost(x) 64.gzip 3.93 E+0 4.07 E+5 22.7 460.0 20.3 75.vpr 6.26 E+0 0.38 E+5 33.7 809.0 24.0 76.gcc.4 E+0 32.55 E+5 4.7 87.0 39.7 8.mcf 2.29 E+0 2.62 E+5 46.6 408.0 8.8 86.crafty.36 E+0 0.32 E+5 37.2 427.0 38.4 97.parser 6.05 E+0 4. E+5 80.2 926.0 24.0 252.eon 7.8 E+0 0.3 E+5 20. 786.0 39.2 253.perlbmk 4.69 E+0 8.4 E+5 4.9 542.0 36.3 254.gap.6 E+0 3.50 E+5 39.4 26.0 32.0 255.vortex 6.45 E+0 0.92 E+5 9.7 770.0 39.0 256.bzip2 4.72 E+0 4.7 E+5 23.4 548.0 23.4 300.twolf 4.74 E+0 0.60 E+5 93.8 97.0 20.4 68.wupwise 5.68 E+0 28.82 E+5 2.8 695.0 5.0 7.swim 9.02 E+0 3.2 E+5 99.7 4.0.4 72.mgrid 42.55 E+0 9.0 E+5 75.2 4395.0 25. 73.applu 8.47 E+0 28.57 E+5 68.0 999.0 29.4 77.mesa 5.85 E+0.37 E+5 56. 799.0 32. 78.galgel 39.75 E+0 8.47 E+5 35.8 4397.0 32.4 79.art 3.03 E+0 0.56 E+5 48.2 470.0 9.7 83.equake 6.87 E+0 6.80 E+5 33.5 774.0 23. 87.facerec 4.62 E+0 2.98 E+5 63.7 623.0 25.5 88.ammp 4.26 E+0 2.4 E+5 84.4 738.0 20.6 89.lucas 5.07 E+0 25.95 E+5 63.9 685.0 26.4 9.fma3d 6.04 E+0 7.06 E+5 82.6 854.0 22.4 200.sixtrack 4.20 E+0 3.94 E+5 33.0 608.0 2. 30.apsi 8.69 E+0 3.28 E+5 02.2 228.0 2.7 400.perlbench 2.99 E+0 52.3 E+5 222.3 2478.0. 40.bzip2 9.73 E+0.98 E+5 29.4 05.0 8. 403.gcc 4.88 E+0 4.74 E+5 3.5 594.0 8.8 429.mcf 2.6 E+0 37.36 E+5 338.3 3627.0 0.7 445.gobmk 2.48 E+0 3.06 E+5 67.0 458.0 2.8 456.hmmer 68.40 E+0 6.6 E+5 78.8 644.0 36.0 458.sjeng 0.99 E+0 28.56 E+5 58.8 2906.0 24.9 462.libquantum 5.47 E+0 20.99 E+5 633.3 4930.0 23.6 464.h264ref 4.0 E+0 3.0 E+5 99.5 4075.0 40.9 47.omnetpp 40.77 E+0 6.44 E+5 422.2 5797.0 3.7 473.astar 29.57 E+0 35.99 E+5 229. 3396.0 4.8 483.xalancbmk 53.38 E+0 50.50 E+5 297.9 6943.0 23.3 40.bwaves 90.5 E+0 45.03 E+5 555.4 855.0 33.4 46.gamess 44.6 E+0 0.52 E+5 22.2 5682.0 73.9 433.milc 5.48 E+0 3.33 E+5 565.8 5852.0 0.3 434.zeusmp 85.62 E+0 8.2 E+5 593. 9584.0 6.2 435.gromacs 30.60 E+0 2.22 E+5 853.8 2883.0 5. 436.cactusADM 230.0 E+0 02.33 E+5 297.5 2475.0 9. 437.leslie3d 2.2 E+0 20.8 E+5 578.2 546.0 20.0 444.namd 7.2 E+0 7.36 E+5 50.7 799.0 23.5 447.dealII 09.73 E+0 88.48 E+5 47.3 2542.0 30. 450.soplex 20.73 E+0 76.33 E+5 236.3 229.0 9.4 453.povray 67.9 E+0 0.40 E+5 227.9 734.0 3.3 454.calculix 385.5 E+0 27.92 E+5 755.2 36728.0 48.6 459.GemsFDTD 55.60 E+0 36.0 E+5 74.4 579.0 22.0 465.tonto 46.23 E+0 6.45 E+5 640. 5280.0 23.9 470.lbm 79.73 E+0 67.03 E+5 430.8 8645.0 20. 48.wrf 200.09 E+0 6.40 E+5 866.4 249.0 24.7 482.sphinx3 33.24 E+0 6.43 E+5 679.6 4750.0 2.7 TABLE II INDIVIDUAL STATISTICS OF THE 55 SPEC2000 AND SPEC2006 TEST PROGRAMS 0

5 SPEC2000 Benchmarks trace length data size(64b lines) avg-fp slowdown(x) all-fp slowdown(x) min.4 E+0 0.3 E+5 8.8 248.2 max 6.05 E+0 32.55 E+5 39.7 360.5 mean 8.20 E+0 0.05 E+5 26. 495.2 TABLE III COMPARISON OF THE MIN, MAX, AND AVERAGE SLOWDOWNS BY AVERAGE-FOOTPRINT ANALYSIS AND BY ALL-FOOTPRINT ANALYSIS. ON AVERAGE, AVERAGE-FOOTPRINT ANALYSIS IS 38 TIMES FASTER. benchmark inputs min n(0 9 ) max n(0 9 ) min m(64 0 3 ) max m(64 0 3 ) min d i max d i 86.crafty 3 0.94 4.40 29.85 29.94 0.0 0.05 88.ammp 3 0.82 32.27 76.55 76.56 0.06 0.6 254.gap 3 0.28 67.64 587.06 347.45 0.09 0.8 64.gzip 3 0.80 5.9 2.33 247.83 0.07 0.20 77.mesa 3 0.7 3.44 99.63 0.07 0. 0.2 83.equake 3 0.34 08.3 58.32 666.40 0. 0.23 76.gcc 3 0.26 6.24 57.68 2347.59 0.08 0.29 79.art 3.02 26.99 22.20 44.86 0.0 0.32 97.parser 3 0.93 78.04 97.69 406.0 0.0 0.33 256.bzip2 3 3.8 33.2 24.72 429.20 0.9 0.36 75.vpr 3 0.30 33.74 2.82 32.74 0.33 0.44 8.mcf 3 0.04 3.56 2.6 259.02 0.09 0.49 300.twolf 3 0.08 05.90.4 46.66 0.09 0.5 444.namd 3 3.66 7.8 734.98 735.52 0.0 0.02 435.gromacs 3 2.85 306.00 22.82 222.2 0.02 0.05 470.lbm 3 4.28 797.3 6702.79 6702.93 0.06 0.0 459.GemsFDTD 3 5.49 556.05 0253.37 3609.65 0.04 0.3 453.povray 3.75 679.2 39.45 39.64 0.03 0.4 454.calculix 3 0.0 385.47 43.3 2792.43 0.09 0.6 482.sphinx3 3 3.27 332.42 455.39 643.02 0.08 0.2 464.h264ref 5 7.70 2876.88 23.27 06.36 0. 0.22 436.cactusADM 3 7.83 2300.05 385.2 0232.62 0.07 0.23 434.zeusmp 3 22.70 856.22 525.06 820.55 0.5 0.26 40.bwaves 3 24.07 905.3 96.20 4502.83 0.04 0.27 429.mcf 3 2.43 26.2 37.95 3736.28 0.7 0.28 458.sjeng 3 7.43 09.88 2854.29 2855.60 0.0 0.29 40.bzip2 6.47 439.7 243.89 6855.6 0.06 0.38 400.perlbench 88 0.00 549.55 3.90 9062.3 0.0 0.4 437.leslie3d 3 30.05 2.8 227.96 208.42 0.5 0.42 403.gcc.80 05.74 47.77 5040.30 0.04 0.46 433.milc 3 4.93 54.8 44.04 332.96 0.2 0.5 450.soplex 5 0.04 209.8 9.44 7633.04 0.2 0.53 456.hmmer 4 0.5 433.38 8.6 66.50 0.0 0.57 462.libquantum 3 0.20 54.66 3.28 2098.58 0.32 0.59 465.tonto 3 2.0 462.34 36.35 644.53 0.20 0.59 473.astar 5 8.5 579.3 04.20 3599.33 0.25 0.66 46.gamess 5.28 2305.93 48.2 00.05 0.08 0.8 47.omnetpp 3.37 407.70 05.54 644.20 0.39 0.84 445.gobmk 20 0.06 328.6 254.23 30.52 0.07.89 TABLE V SIMILARITY OF THE FOOTPRINT IN DIFFERENT EXECUTIONS OF THE 37 SPEC2K/2006 BENCHMARKS AS MEASURED BY THE MAX AND MIN MANHATTAN DISTANCE (max d i, min d i )