Lineartime Modeling of Program Working Set in Shared Cache


 Rachel Houston
 1 years ago
 Views:
Transcription
1 Lineartime Modeling of Program Working Set in Shared Cache Xiaoya Xiang, Bin Bao, Chen Ding, Yaoqing Gao Computer Science Department, University of Rochester IBM Toronto Software Lab Abstract Many techniques characterize the program working set by the notion of the program footprint, which is the volume of data accessed in a time window. A complete characterization requires measuring data access in all O(n 2 ) windows in an n element trace. Two recent techniques have significantly reduced the measurement time, but the cost is still too high for realsize workloads. Instead of measuring all footprint sizes, this paper presents a technique for measuring the average footprint size. By confining the analysis to the average rather than the full range, the problem can be solved accurately by a lineartime algorithm. The paper presents the algorithm and evaluates it using the complete suites of 26 SPEC2000 and 29 SPEC2006 benchmarks. The new algorithm is compared against the previously fastest algorithm in both the speed of the measurement and the accuracy of sharedcache performance prediction. Keywords: Footprint, Cache sharing I. INTRODUCTION During a program execution, its working set can be defined as the footprint, which is the volume of data accessed in an execution window. Since the footprint shows the active data usage, it has been used to model resource sharing among concurrent tasks and to improve throughput and enforce fairness, either in memory sharing among multiprogrammed workloads or more recently in cache sharing among multicore workloads. A trace of n data accesses has ( ) n 2 n(n ) 2 distinct windows and therefore n(n ) 2 footprints. Early studies measured program footprints in the shared cache of timesharing systems. Since applications interact between time quanta, it is sufficient to consider just the windows of a single length the length of a scheduling quantum [20], [22]. On today s multicore systems, however, programs interact continuously. A number of techniques were developed to estimate the footprint in alllength windows, but they did not guarantee the precision of the estimation [2], [3], [6], [8], [9], [2]. A recently published technique called allwindow footprint analysis can measure all footprints in O(CKlogM) time, where CK is linear to the length of the trace and M is the volume of data accessed in the trace [23]. For each window length, the analysis shows the maximum size, the minimum size, and the size distribution of footprints in all windows of this length [23]. The analysis is not fully accurate but guarantees a relative precision, e.g. 99%. We call the analysis allfootprint analysis, because it measures the size of every footprint. In this paper, we present averagefootprint analysis. For each window length, the analysis shows the average size of footprints in all windows of this length. While the analysis gives the accurate average, it does not measure the range or the distribution. However, a weaker analysis can often be done faster. Indeed, we show that the average footprint can be measured accurately in linear time O(n) for a trace of length n, regardless of the data size. The average footprint is a function mapping from the length of a execution window to the volume of its data access. Intuitively, the working set increases in larger execution windows. We prove that the average footprint is monotonically nondecreasing. The new analysis precisely quantifies the growth of the average footprint over time. The previous, allfootprint analysis was the key metric used in the composable models of cache sharing [3], [6], [2], [23]. For P programs, there are 2 P corun combinations. A composable model makes 2 P predictions using P singleprogram runs rather than 2 P parallel runs. As an alternative to allfootprint analysis, the new averagefootprint analysis can be used in the composable model to reduce the (footprint) measurement cost asymptotically. To evaluate the speed and usefulness of the averagefootprint analysis, we test it on the complete suites of SPEC 2000 and SPEC 2006 CPU benchmarks and compare the results with the fastest allfootprint analysis [23]. To measure the accuracy of cache sharing prediction, we rank the slowdowns in two and threeprogram coruns on a quadcore machine and compare the predicted ranking with exhaustive testing. Through experiments, we show that the averagefootprint analysis can predict the effect of cache interference as accurately as the allfootprint analysis, yet at only a fraction of its cost. In fact, the cost of allfootprint analysis is too high for it to model SPEC 2006 benchmarks, which have up to.9 trillion accesses to up to GB data. In comparison, the averagefootprint analysis can model all SPEC 2006 benchmarks, finishing most of the programs within a few hours of time. This study has two limitations. First, we are concerned with parallel workloads consisting of only sequential programs that do not share data. We do not consider parallel programs, although similar footprint metrics have been studied to model multithreaded workloads [6], [7]. Second, the footprint results are input specific, so they are useful mostly in workload characterization, for example, finding the most and the least interference among a set of benchmark programs.
2 rd 5 b) Footprint windows and the cache sharing model: Offline cache relative sharing models O(TlogN) were pioneered O(CKlogN) by Chandra et al. [3] thread A a bcdefa and Suh precision et al. [2] for algorithm a group of independent algorithmprograms and approx. ft 4 extended for multithreaded code by Ding and Chilimbi [6], Schuff et al. [7], and thread B O(TN) Jiang et al. O(CKN) [2] Let A, B be two kmmmnon programs accurate that share algorithm the same cachealgorithm but do not shared data, the effect of B on the locality of A is rd rd+ft 9 thread A&B accurate constantprecision with B) a k bcm d m e m f nona P (capacity miss by A when corunning approximation (a) In shared cache, the reuse distance in program A is P ((A s reuse distance + B s footprint) cache size) lengthened by the footprint of program B. Given an execution allwindow window in statistics a sequential trace, the prog. B In shared cache, B the and reuse A coexecution distance in thread (b) footprint four algorithms is the number for measuring of distinct footprint elements in accessed all in the A is lengthened by the footprint of thread B. execution windows in a trace. T is the length of window. The examples in Figure (a) illustrates the interaction trace and N the largest abbaadaccc axbybyaxaxdwaxczczcz between locality and footprint. In the first example, a reuse 4 cache misses on 3element window in program A concurs with a time window in program fully associative LRU cache. B. The reuse distance of A is lengthened by the footprint of prog. A B s window. The second example uses two pairings of three xyyxxwxzzz traces to show that the shared cache miss rate depends also B2 and A coexecution on the footprint, not just the miss rate of corun threads An implication of the cache sharing model is that cache prog. B2 axbycycxbxcwcxczdzdz interference is asymmetric for programs with different locality 2 cache misses on 3element abccbcccdd fully associative LRU cache. and footprints. A program with large footprints and short reuse distances may disproportionally slow down other programs (b) Programs B and B2 have the same miss rate. However, A and while experiencing little or no slowdown itself. This was B incur 50% more misses in shared cache than A and B2. The difference is caused not by data reuse but by data footprint. observed in experiments [23], [24]. In one program pair, the first program shows a near 85% slowdown while the other Fig.. Example illustrations of cache sharing. Programs B and B2 have the same miss rate. However, AB incurs 50% more program shows only a 5% slowdown. misses in shared cache than AB2. The difference is caused not by data reuse but by footprint. II. BACKGROUND ON OFFLINE CACHE MODELS Offline cache models do not improve performance directly but can be used to understand the causes of interference and to predict its effect before running the programs (so they may be grouped to reduce interference). Offline analysis measures the effect of all data accesses, not just cache misses. It characterizes a single program unperturbed by other programs and the analysis itself. Such cleanroom metrics avoid the chickenegg problem when programs are analyzed together: the interference depends on the miss rate of corunning programs, but their miss rate in turn depends on the interference. Next we describe first the locality model of private cache and then the model for shared cache. a) Reuse windows and the locality model: For each memory access, the temporal locality is determined by its reuse window, which includes all data accesses between this and the previous access to the same datum. Specifically, whether the access is a cache (capacity) miss depends on the reuse distance, the number of distinct data elements accessed in the reuse window. The relation between reuse distance and the miss rate has been well established [9], [6]. The capacity miss rate can be defined by a probability function involving the reuse distance and the cache size. Let the test program be A. P (capacity miss by A alone) P (A s reuse distance cache size) singlewindow statistics III. THE MEASUREMENT OF AVERAGE FOOTPRINT A. Definitions Let W be the set of ( n 2) windows of a lengthn trace. Each window w < t, v > has a length t and a footprint v. Let I(p) be a boolean function returning when p is true and 0 otherwise. The footprint function f p(t) averages over all windows of length t: w fp(t) v i W ii(t i t) w w I(t v i W ii(t i t) i W i t) n t + For example, the trace abbb has 3 windows of length 2: ab, bb, and bb. The corresponding footprints are 2,, and, so fp(2) (2 + + )/3 4/3. B. O(n) Algorithm There is a lineartime algorithm that calculates the precise average footprint for all execution windows of a trace. Let n, m be the length of the trace and the number of distinct data used in the trace. The algorithm first measures the follow three quantities: the distribution of the time distances of all data reuses (n m distances) the firstaccess times of all distinct data (m access times) the lastaccess times of all distinct data (exact definition later, m access times) The three quantities can be measured by a single pass over the trace using a hash table with one entry for each distinct data. The cost is linear, O(n) in time and O(m) in space. 2
3 The three measures are the inputs to a formula f p(w). For any window size w(0 < w N), fp(w) computes the average footprint for all windows of size w. In other words, the formula computes the average footprint for windows of all sizes without having to inspect the trace again. In the rest of the section, we derive the formula and discuss its complexity. The main idea of the formula is differential counting, which counts the difference in the footprint between consecutive windows. For any window size w, we start with the footprint in the first window and then compute its increase or decrease as the window moves forward in the trace. The firstaccess times are sufficient to compute the footprint of the first window. The change in later windows depends on two metrics on each trace element d i : the forward time distance fwd(d i ) and the backward time distance bwd(d i ). Let datum x be accessed at d i. Let the closest accesses of x be d j before d i and d k after d i. Then fwd(d i ) k i and bwd(d i ) i j. The forward and backward time distances determine the change of footprint between consecutive windows. The relation is shown in Figure 2. diw diw+ fp(iw) fp(iw+) bwd(di) di di fp(i) fwd(di) di+ fp(i+) di+w di+w Fig. 2. An illustration how the forward and backward (reuse) time distance influences the change in footprint between consecutive windows Let the footprint of a wsize window starting at i be fp(i). Each element d i in the trace affects the footprint of w windows: fp(i w+), fp(i w+2),..., fp(i). In differential counting, we consider only the effect of d i on two pairs of windows: the change from fp(i w) to fp(i w + ) when d i enters into its first window and the change from fp(i) to fp(i + ) when d i exits from its last window of influence. Figure 2 shows d i and the two pairs of windows where d i enters between the first pair and exits between the second pair. When d i enters, it does not increase the footprint fp(i w) if the same datum was previously accessed within f p(i w +), which means that its backward time distance is no greater than w (bwd(d i ) w). This is the case illustrated in Figure 2. Otherwise, d i adds to the footprint fp(i w). Similarly, when d i exits from fp(i), the departure does not change fp(i + ) if fwd(d i ) w; otherwise, it subtracts from fp(i + ), as in the case illustrated in Figure 2. The footprint f p(i + ) depends on three factors: the footprint fp(i), the contribution of the entering d i+w, and the detraction of the exiting d i. The footprint of all windows is then computed by adding these differences. Next we formulate this computation. We use the following notations. n, m, w: the length of the trace, the size of data, and the window size of interest d i : the ith trace access fp(i): the footprint of the window from d i to d i+w (including d i and d i+w ) dk bwd(d i ): the backward reuse time distance of d i, if d i is the first access. fwd(d i ): the forward reuse time distance of d i, if d i is the last access. I(p): a boolean function that returns if p is true and 0 otherwise. For example, I(bwd(d i ) > w) gives the contribution by d i, which is if bwd(d) > w and 0 otherwise. Similarly, I(fwd(d i ) > w) gives the detraction of d i, if fwd(d) > w and 0 otherwise. The total size of the footprints in all windows of length w, when divided by the number of windows n w +, is the average footprint, as shown next in Equation. Since fp(w) n w+ X fp(i) () n w + fp(i + ) fp(i) + I(bwd(d i+w ) > w) I(fwd(d i ) > w) (2) Expanding Equation using Equation 2, we have three components in the average footprint: fp(w) fp() + n iw+ n w n w + ( (n i + )I(bwd(d i ) > w) (n i + w)i(fwd(d i ) > w)) (3) Next we compute each component separately. The footprint of the first window of length w is fp() w I(bwd(d i ) ) (4) In the next component, we split the forward time distances into two groups: finite and infinite distances. The summation order of the finite distances can be changed from to n instead of from w + to n. n iw+ n iw+ + n iw+ (n i + )I(bwd(d i ) > w) (5) (n i + )I(w < bwd(d i ) < ) (n i + )I(bwd(d i ) ) n (n i + )I(w < bwd(d i ) < ) + n iw+ (n i + )I(bwd(d i ) ) Similarly, we decompose and simplify the forward distances: 3
4 (n i + w)i(fwd(d i ) > w) (6) n w n (n i + w)i(w < fwd(d i ) < ) n w + (n i + w)i(fwd(d i ) ) Combining the Equations 4, 5, and 6, we can now expand Equation 3. Instead of using individual accesses, we now use the three inputs, defined as follows: f i : the first access time of the ith datum l i : the reverse last access time of the ith datum. If the last access is at position x, l i n + x, that is, the first access time in the reverse trace. r t : the number of accesses with a reuse time distance t fp(w) n I(bwd(d i ) ) + n w + ( n (w i)i(bwd(d i ) ) iw+ n w + (n i + w)i(fwd(d i ) ) n (n i + )I(w < bwd(d i ) < ) n (n i + w)i(w < fwd(d i ) < )) M m + n w + ( (w f i )I(f i > w) + M (w l i )I(l i > w) n + (w t)i(t > w)r t ) t m m n w + ( (f i w)i(f i > w) m + (l i w)i(l i > w) + n tw+ (t w)r t ) (7) The formula of Equation 7 passes the sanity check that the average footprint fp(w) is at most the data size m, and the footprint of the whole trace (w n) is m. Fixing the window length w and ignoring the effect of first and last accesses, we see that the footprint decreases if more reuse time distances (r t ) have larger values (t). This suggests that improving locality reduces the average footprint. For example, if we double the length of a trace by repeating each element twice, the length of the long time distances would double, and the average footprint would drop. For each window length w, the Equation 7 can be computed in time O(w). If we limit to consider only window sizes of a logarithmic scale, the formula can be represented and evaluated in O(log w) time. C. Monotonicity Theorem 3.: The average footprint f p(w) is nondecreasing. Proof: Let wi k denotes the ith window whose size is k, f(wi k ) denotes the footprint of the ith window whose size is k. We prove that, k, 0 < k n, fp(k + ) fp(k). First, i, 0 < i n k, the following holds because wi k and wi+ k are both contained in wk+ i : f(w k+ i ) f(w k i ) f(w k+ i ) f(w k i+ ) In addition, we have k, 0 < k n, j, 0 < j n k +, such that, f(wj k ) fp(k). Now then fp(k + ) n k f(w k+ i ) n k j n k [ n k f(w k+ i ) + ij j n k [ n k f(wi k ) + ij f(w k+ i )] f(w k i+)] j n k [ n k+ f(wi k ) + f(wi k )] ij+ n k+ n k [ f(wi k ) f(wj k )] n k [(n k + )fp(k) f(wk j )] fp(k) + n k [fp(k) f(wk j )] fp(k) IV. AVERAGE FOOTPRINT IN THE COMPOSABLE MODEL Our previous work used allfootprint analysis in the composable model to predict cache interference [23]. In the composable model, when multiple programs are run together, each reuse distance in a program is lengthened by the aggregate footprint of all peer programs over the same time window. Suppose there are n programs t, t 2,..., t n running on a shared cache, the miss rate is computed by P (capacity miss by t i running with t j, j,..., n, j i ) P ((t i s reuse distance + j i t j s footprint) cache size) Suppose the distribution of program t i s reuse distance is D rd (t i ), and the distribution of program t i s footprint of 4
5 window size w is D fp (t i, w). The first distribution is defined as D rd (t i ) {< x ki, p ki > p k } where < x ki, p ki > means the probability of the reuse distance equals x ki is p ki. Similarly, we define D fp (t i, w) {< yk w i, qk w i > qk w i } Given a window size w, we use < yk w i, qk w i > to mean that the probability that the footprint equals yk w i is qk w i. Consider a 2program corun involving t and t 2. The capacity miss rate by t is calculated as follows by Equation 8. mr(t ) k k 2 p k q w(xk ) k 2 I(x k + y w(x k ) k 2 C) (8) where I is the identity function, and w(x k ) is the size of the reuse window that contains the reuse distance x k. This is the equation employed by allfootprint based modeling [23]. To use averagefootprint analysis instead, we define the average footprint of a window size w for program t i as F (t i, w) fi w. Equation 8 can be simplified to Equation 9. mr(t ) k p k I(x k + f w(x k ) 2 C) (9) The estimation of the execution time from the miss rate is the same as [23]. The only difference is that the previous model uses allfootprint analysis and Equation 8, and the new model uses averagefootprint analysis and the simpler Equation 9. A. Experimental Setup V. EVALUATION We have implemented the averagefootprint analysis algorithm in a profiling tool and tested 26 SPEC2K benchmarks, 2 integer and 4 floatingpoint, and 29 SPEC2006 benchmarks, 2 integer and 7 floatingpoint. All benchmarks are instrumented by Pin [5] and profiled on a machine with an Intel Core i5660 processor and 4GB physical memory. The machine is set up with Fedora 3 and GCC The twoprogram corun results for SPEC 2000 are collected on an Intel Core 2 Duo machine with two 2.0GHz cores sharing 2MB L2 cache and 2GB memory. In order to measure 3program coruns, we use an Intel quadcore machine, with four 2.27GHz cores sharing 8MB L3 cache and 8GB memory. Except in Section VG when we examine the effect of input, we use the reference input in the test. Some programs, especially SPEC 2006, have multiple reference inputs. We use the first one tested by the autorunner. In performance comparisons, the base program run time is one without Pin instrumentation or any other analysis. The length of SPEC 2000 traces ranges from 4 billion in gcc to 425 billion in mgrid. The amount of data ranges from 3 thousand 64byte cache blocks (MB) in eon to 3.2 million cache blocks (256MB) in gcc. The SPEC 2006 traces on average are 0 times as long as SPEC 2000 traces and have 5 times as many cache blocks. The trace bwaves is the longest with.9 trillion data accesses and has the most data, 928MB. The individual statistics of the 55 programs is listed in Table II. To evaluate cachesharing predictions, we run two experiments: 2program coruns. We predict all 2program coruns and compare the predicted ranking with that of the previous work using the 5 SPEC 2000 benchmarks used in the previous work [23]. 3program coruns. We started with the 0 representative benchmarks in SPEC2006 as selected by Zhuravlev et al. [29]. Reusedistance analysis was too slow to measure 2 programs. We evaluate the prediction for all program triples of the remaining 8 programs. In both tests, we also compare with a simple prediction method based on miss rates (by ranking the total miss rate of the programs in the corun group) [23]. B. Efficiency of Averagefootprint Analysis Table I summarizes the analysis cost for the two benchmark suites, and for each suite, the average for integer and for floatingpoint programs. It divides the 55 tests into four groups: 2 SPEC 2000 integer programs, 4 SPEC 2000 floatingpoint programs, 2 SPEC 2006 integer programs, and 7 SPEC 2006 floatingpoint programs. The result of each group is summarized in three rows and three columns. The columns show the trace length, the data size, and the slowdown ratio of the profiling time to the unmodified run time. The rows show the minimum, maximum, and the average slowdown factors for all benchmarks of the group. The minimum slowdowns in four benchmark groups are all below 0. The maximum slowdowns are 40, 32, 4, and 74. The average slowdowns are between 2 in SPEC 2006 integer tests and 29 in SPEC 2000 integer tests. On average across all four groups, averagefootprint analysis takes no more than 30 times of the original execution time. The individual results of the 55 programs are shown in Table II. Compared to the summary table, the individualresult table has two additional columns, which show the unmodified execution time and the time of averagefootprint analysis. The unmodified time measures the execution of the original program without any instrumentation or analysis. On average, an unmodified SPEC 2000 program takes less than 3 minutes, and an unmodified SPEC 2006 program takes close to 0 minutes. Averagefootprint analysis takes 3 to 73 minutes for SPEC 2000 programs and 0 minutes (gcc) to 0 hours (calculix) for SPEC 2006 programs. C. Comparison with Allfootprint Analysis Allfootprint analysis can analyze SPEC 2000 programs but not SPEC 2006 programs. We compare average and allfootprint analysis on SPEC 2000 programs in Table III. SPEC 2000 has 26 programs in total. The paper on allfootprint analysis reported results for 5 of the programs [23]. The table summarizes the cost of the two analyses in these 5 tests in the last two columns. The slowdowns by averagefootprint analysis are between 8.8 and 40. The slowdowns by allfootprint analysis are between 248 and 360. The average slowdown is 40 for averagefootprint analysis and about 500 5
6 benchmarks stats trace length data size(64b lines) avgfp slowdown(x) SPEC2000 INT min.4 E E programs max 6.05 E E mean 7.52 E+0.67 E SPEC2000 FP min 3.03 E E programs max E E mean 7.44 E E SPEC2006 INT min 4.88 E E programs max 5.47 E E mean E E SPEC2006 FP min E E programs max E E mean E E TABLE I THE MIN, MAX, AND AVERAGE COSTS (SLOWDOWNS) OF AVERAGEFOOTPRINT ANALYSIS FOR 55 SPEC 2000 AND SPEC 2006 BENCHMARK PROGRAMS for allfootprint analysis. In other words, on average for these 5 programs, averagefootprint analysis is 38 times faster than allfootprint analysis. Allfootprint analysis takes too long for SPEC 2006 programs. For example, it takes averagefootprint analysis 0 hours to profile calculix. Being 38 times slower, it would take more than two weeks to measure the allfootprint distribution. D. TwoProgram Corun Ranking The prior work showed corun ranking results for 5 SPEC 2000 programs based on allfootprint analysis and compared with missrate based ranking and exhaustive testing [23]. We now show the ranking results using averagefootprint analysis and compare it with the three previous ranking methods. We show the prediction results in a 2D plot. The xaxis is the rank of program corun groups. In this test, the rank ranges from (the least interfering pair) to 05 (the most interfering pair). The yaxis shows the interference, measured by the quadratic mean of the slowdowns of programs in the corun group. The slowdown of a corun program is the ratio of its corun time and the time running alone on the same machine (cache). For two programs with slowdowns s, s 2, s 2 +s we have y The three graphs in Figure 3 show the plots for the predictions based on miss rate, allfootprint analysis, and averagefootprint analysis. In each plot, the accurate result from exhaustive testing is shown by a monotonically increasing red line as a reference. The simple missrate based prediction does not show an increasing trend, suggesting no correlation between the prediction and the actual interference. The two footprintbased predictions show significant correlation. Programs predicted to have a high interference tend to actually have a high interference. Averagefootprint analysis ranks several program pairs better than allfootprint analysis. Consider the pair with the highest interference, art,mcf with a slowdown of 2. The pair is ranked 23 by miss rate, 99 by allfootprint analysis, and 05 by averagefootprint analysis. The averagefootprint rank is precisely correct. Allfootprint ranking has a significant misprediction for the program pair gcc,art. The pair slows down each other by.6 times. It should be ranked 97 but ranked 44 by allfootprint analysis, which is worse than the missrate rank 70. The rank by averagefootprint analysis is relatively the best at 86. E. ThreeProgram Corun Ranking Evaluating larger group coruns is difficult because the number of tests increases exponentially with the size of corun group. To test all 3program coruns in SPEC 2006 benchmarks, we would have to run ( ) tests. Even if we ran all the tests, it would have been impossible to show the results clearly. Fortunately, Zhuravlev et al. have analyzed the benchmark set based on the cache miss rates and access rates and identified 0 representatives [29]. We had to narrow down further because the reusedistance analysis could finish only for 8 out of the 0 representatives: 403.gcc, 46.gamess, 429.mcf, 444.namd, 445.gobmk, 450.soplex, 453.povray, and 470.lbm. There are 56 different 3program groups from these 8 benchmarks. We show the prediction results in Figure 4. The results for 3program coruns of SPEC 2006 programs are similar to those of 2program coruns of SPEC 2000 programs. As before, the missrate based prediction does not show a detectable correlation while the averagefootprint analysis shows a clear correlation with the actual interference. The maximal slowdown increases from 2.0 in 2program coruns to 3.3 in 3program coruns, confirming the expectation that the interference becomes worse as the cache is shared by more programs. Exhaustive testing is also increasingly infeasible. For both reasons, the composable model is more valuable, so is the higher efficiency from the averagefootprint analysis. F. Rank and Performance Closeness To quantify the difference between the predicted ranking and the accurate ranking, we define two metrics: the rank closeness and the performance closeness. The rank closeness shows on average how the predicted rank of a corun group differs from the actual rank. We number n corun groups by their accurate rank i. Let pred(i) be the predicted rank for group i. The rank closeness is defined as n pred(i) i rank closeness n 6
7 slowdown miss rate exhaustive testing slowdown miss rate exhaustive testing ranked program pairs (from least interference to most interference) ranked program triples (from least interference to most interference) slowdown all footprint exhaustive testing slowdown average footprint exhaustive testing ranked program pairs (from least interference to most interference) ranked program triples (from least interference to most interference) slowdown average footprint exhaustive testing ranked program pairs (from least interference to most interference) Fig. 3. Evaluation of 2program corun predictions for 5 SPEC2000 benchmark programs. The prediction quality of averagefootprint analysis is similar to that of allfootprint analysis. The formula is the Manhattan distance between two vectors < p(), p(2),..., p(n) > and <, 2,..., n >, divided by n. The worst possible ranking has a rankcloseness score of n/2 if n is even or (n )/2 otherwise. Next we quantify the error in terms of the mispredicted slowdown. Let f(i) be the slowdown of the corun group with the accurate rank i, and f(pred(i)) be the slowdown Fig. 4. Comparison of 3program corun predictions for 8 SPEC2006 benchmarks. Allfootprint analysis cannot model these programs because of its high cost. of the corun group with the predicted rank i. The difference f(pred(i)) f(i) gives the misprediction. The performance closeness is the average misprediction for all groups: n f(pred(i)) f(i) performance closeness n The two metrics are shown in Table IV. On average for 2 program coruns, the missrate rank errs by 35 positions, while the footprintbased ranks err by 4 and 5 positions. For 3 program coruns, the missrate rank errs by 9 positions, while the averagefootprint rank errs by 6. In terms of performance, the missrate based ranking mispredicts twice as bad as the footprintbased ranking. In search of a closeness metric, we also measured the Levenshtein distance. For two permutations of a set of numbers, the Levenshtein distance measures the number of edits needed to convert one to the other. For the 2program corun test, the distance is 03 for miss rate, 97 for allfootprint, and 96 for averagefootprint. For the 3program corun test, the distance is 54 for miss rate and 48 for averagefootprint. Levenshtein is not a good metric since it does not distinguish a ranking that does not show a correlation from rankings that do. 7
8 2program corun over 5 SPEC2000 benchmarks ranking strategy perf closeness rank closeness miss rate allfootprint avgfootprint program corun over 0 SPEC2006 benchmarks ranking strategy perf closeness rank closeness miss rate avgfootprint TABLE IV COMPARE DIFFERENT RANKING STRATEGIES G. The Effect of Input on Average Footprint The footprint of a program execution is affected by the program input just as the length of the execution is affected by the input. An important question for profilingbased techniques is how much the footprints in training runs may differ from those in test runs. In this section, we give a preliminary measure of this difference. Given a set of k executions of the same program, we quantify the variation between the k footprint functions (f i (w)) as follows. First, we compute the average of the average footprints: f(w) k k f i(w). Then we compute the Manhattan distance between the ith execution and the average as: d i W j fi(wj) f(w j) ) W where W is the number of different window lengths. A Manhattan distance of x% means that on average, the input i s footprint function differs from the average by x% in each window size. Table V shows SPEC2000 and SPEC2006 programs, the number of inputs (provided by the benchmark suite and tested in our experiments), the range of trace lengths and data sizes in these inputs, the smallest and largest Manhattan distances as we just defined. The majority of programs, 20 out of 37, see no more than 30% difference between footprints of different inputs. The minimal difference is less than 20% in all but 5 programs. Note that the effect of the input may be predicted using model fitting based on the input characteristics [26]. This is outside the scope of this paper. VI. RELATED WORK Locality models: Locality in private cache can be modeled by reuse distance, which can be measured with a guaranteed precision in time O(n log 2 m), where n is the length of the trace and m is the size of data [26]. Reuse distance has found many uses in workload characterization and program optimization [26]. There are a number of recent developments. Chauhan and Shei gave a method for static analysis of locality in MATLAB code [4]. Unlike profiling whose results are usually input specific, static analysis can identify and model the effect of program parameters. Most previous models targeted program analysis. Ibrahim and Strohmaier used synthetic probes to emulate the locality of an application for efficient machine characterization [0]. Zhou studied random cache replacement policy and gave a onepass deterministic traceanalysis algorithm to compute the average miss rate (instead of simulating many times and taking the average) [28]. Finally, Schuff et al. defined multicore reuse distance analysis and improved its efficiency through sampling and parallelization [7]. The sampling was based on a method developed by Zhong and Chang earlier [25]. These techniques are concerned with only reuse windows and cannot measure the footprint in all execution windows, which is the problem addressed in this paper. Offline cache sharing models: The average workingset size in singlelength execution windows such as a scheduling quantum can be computed in linear time. It has been used in studying multiprogrammed systems [7], [22]. In a parallel environment such as today s multicore processors, programs interact constantly. The interference in alllength windows has been considered for memory [2] and for cache [3]. Both used the following recursive equation involving the working set and the miss rate. As a window of size w is extended to w +, the change in the working set depends on whether the new access is a miss. Suh et al. assumed linear function growth when window sizes were close [2]. Chandra et al. computed the recursive relation bottom up [3]. The same problem has been solved using statistical inference. Two techniques by Berg and Hagersten (known as StatCache) [2] and by Shen et al. [9] were used to infer cache miss rate from the distribution of reuse times. Berg and Hagersten assumed constant miss rate over time and random cache replacement [2], and Shen et al. assumed a Bernulli process and LRU cache replacement [8], [9]. The latter method was adapted to predict cache interference [2]. A precise prediction was shown useful in an approximately solving the optimal coscheduling problem []. However, none of these method can bound the approximation error. Our earlier work gave the first precise methods for measuring the footprint [5], [23] and an iterative model for the circular effect of cache interference [23]. The lineartime algorithm in this paper computes the average rather than the full distribution and improves the measurement speed by near 40 times yet maintains a similar accuracy in sharedcache locality prediction. Online models: The miss rate curve has been used for memory partitioning to ensure fairness or maximize throughput in a parallel workload [27]. Similarly, reuse distance has been used for cache partitioning among data objects [4]. Recently, Zhuravlev et al. reviewed four models based on the miss rate and the reuse distance [29]. As online models, these techniques did not consider the working set metrics because of the cost. For example, Zhuravlev et al. considered a less accurate model from Chandra et al. because for efficiency it did not require allwindow footprints [3]. Zhuravlev et al. showed that cache sharing is one of the factors but not necessarily the major factor [29]. Still, an accurate and fast solution may help to quantify the contribution from cache sharing in the overall interference. Analytical models and streaming analysis: Counting the number of distinct data items has been considered as a 8
9 streaming analysis problem. Spaceefficient (less than O(m)) solutions exist to measure frequency moments F 0 (footprint), F (total frequency), F 2, F (most frequent item), and entropy [], [8], [3]. Instead of counting the F 0 moment over the whole trace, we solve the problem of collecting the average F o for all execution windows and focus on reducing the time complexity from O(n 2 ) to linear. Streaming solutions may be combined to further reduce the space requirement of our algorithms. VII. SUMMARY Complete characterization of footprint requires measuring data access in all execution windows. In this paper, we have presented the average footprint as a metric of allwindow footprint. The footprint function maps from time to average footprint, We have shown that the average footprint function is monotone and can be used in the composable model to rank cache interference in shared cache without having to test any parallel executions. We have presented a lineartime algorithm for accurately measuring the average footprint. The lineartime algorithm uses differential counting based on the forward and backward reuse time distance. When tested on SPEC CPU 2000 benchmarks, the averagefootprint analysis is on average 38 times faster than the previous, allfootprint analysis, yet it shows comparable accuracy in sharedcache locality prediction. The averagefootprint analysis was efficient enough to measure the newer, SPEC 2006 benchmarks, but the allfootprint analysis could not. ACKNOWLEDGMENT We would like to thank Tongxin Bai for providing histogram mapping libraries. The presentation has been improved by the suggestions from Xipeng Shen and the systems group at University of Rochester. Xiaoya Xiang and Bin Bao are supported by two IBM Center for Advanced Studies Fellowships. The research is also supported by the National Science Foundation (Contract No. CCF604, CCF , CNS ). REFERENCES [] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the ACM Symposium on Theory of Computing, pages 20 29, 996. [2] E. Berg and E. Hagersten. Fast datalocality profiling of native execution. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems, pages 69 80, [3] D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting interthread cache contention on a chip multiprocessor architecture. In Proceedings of the International Symposium on HighPerformance Computer Architecture, pages , [4] A. Chauhan and C.Y. Shei. Static reuse distances for localitybased optimizations in MATLAB. In International Conference on Supercomputing, pages , 200. [5] C. Ding and T. Chilimbi. Allwindow profiling of concurrent executions. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, poster paper. [6] C. Ding and T. Chilimbi. A composable model for analyzing locality of multithreaded programs. Technical Report MSRTR , Microsoft Research, August [7] B. Falsafi and D. A. Wood. Modeling cost/performance of a parallel computer simulator. ACM Transactions on Modeling and Computer Simulation, 7():04 30, 997. [8] P. Flajolet and G. Martin. Probabilistic counting. In Proceedings of the Symposium on Foundations of Computer Science, 983. [9] M. D. Hill and A. J. Smith. Evaluating associativity in CPU caches. IEEE Transactions on Computers, 38(2):62 630, 989. [0] K. Z. Ibrahim and E. Strohmaier. Characterizing the relation between ApexMap synthetic probes and reuse distance distributions. Proceedings of the International Conference on Parallel Processing, 0: , 200. [] Y. Jiang, X. Shen, J. Chen, and R. Tripathi. Analysis and approximation of optimal coscheduling on chip multiprocessors. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages , [2] Y. Jiang, E. Z. Zhang, K. Tian, and X. Shen. Is reuse distance applicable to data locality analysis on chip multiprocessors? In Proceedings of the International Conference on Compiler Construction, pages , 200. [3] A. Lall, V. Sekar, M. Ogihara, J. Xu, and H. Zhang. Data streaming algorithms for estimating entropy of network traffic. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems, pages 45 56, [4] Q. Lu, J. Lin, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. SoftOLP: Improving hardware cache performance through softwarecontrolled objectlevel partitioning. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages , [5] C.K. Luk et al. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, Chicago, Illinois, June [6] R. L. Mattson, J. Gecsei, D. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM System Journal, 9(2):78 7, 970. [7] D. L. Schuff, M. Kulkarni, and V. S. Pai. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 53 64, 200. [8] X. Shen and J. Shaw. Scalable implementation of efficient locality approximation. In Proceedings of the Workshop on Languages and Compilers for Parallel Computing, pages , [9] X. Shen, J. Shaw, B. Meeker, and C. Ding. Locality approximation using time. In Proceedings of the ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, pages 55 6, [20] H. S. Stone, J. Turek, and J. L. Wolf. Optimal partitioning of cache memory. IEEE Transactions on Computers, 4(9): , 992. [2] G. E. Suh, S. Devadas, and L. Rudolph. Analytical cache models with applications to cache partitioning. In International Conference on Supercomputing, pages 2, 200. [22] D. Thiébaut and H. S. Stone. Footprints in the cache. ACM Transactions on Computer Systems, 5(4): , 987. [23] X. Xiang, B. Bao, T. Bai, C. Ding, and T. M. Chilimbi. Allwindow profiling and composable models of cache sharing. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 9 02, 20. [24] X. Zhang, S. Dwarkadas, and K. Shen. Towards practical page coloringbased multicore cache management. In Proceedings of the EuroSys Conference, [25] Y. Zhong and W. Chang. Samplingbased program locality approximation. In Proceedings of the International Symposium on Memory Management, pages 9 00, [26] Y. Zhong, X. Shen, and C. Ding. Program locality analysis using reuse distance. ACM Transactions on Programming Languages and Systems, 3(6): 39, Aug [27] P. Zhou, V. Pandey, J. Sundaresan, A. Raghuraman, Y. Zhou, and S. Kumar. Dynamic tracking of page miss ratio curve for memory management. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages 77 88, [28] S. Zhou. An efficient simulation algorithm for cache of random replacement policy. In Proceedings of the IFIP International Conference on Network and Parallel Computing, pages 44 54, 200. Springer Lecture Notes in Computer Science No [29] S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing shared resource contention in multicore processors via scheduling. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages 29 42,
10 benchmark trace length data size(64b lines) unmodified time(sec) avgfp analysis time FP alg cost(x) 64.gzip 3.93 E E vpr 6.26 E E gcc.4 E E mcf 2.29 E E crafty.36 E E parser 6.05 E+0 4. E eon 7.8 E E perlbmk 4.69 E E gap.6 E E vortex 6.45 E E bzip E E twolf 4.74 E E wupwise 5.68 E E swim 9.02 E E mgrid E E applu 8.47 E E mesa 5.85 E+0.37 E galgel E E art 3.03 E E equake 6.87 E E facerec 4.62 E E ammp 4.26 E E lucas 5.07 E E fma3d 6.04 E E sixtrack 4.20 E E apsi 8.69 E E perlbench 2.99 E E bzip E+0.98 E gcc 4.88 E E mcf 2.6 E E gobmk 2.48 E E hmmer E E sjeng 0.99 E E libquantum 5.47 E E h264ref 4.0 E E omnetpp E E astar E E xalancbmk E E bwaves 90.5 E E gamess 44.6 E E milc 5.48 E E zeusmp E E gromacs E E cactusADM E E leslie3d 2.2 E E namd 7.2 E E dealII E E soplex E E povray 67.9 E E calculix E E GemsFDTD E E tonto E E lbm E E wrf E E sphinx E E TABLE II INDIVIDUAL STATISTICS OF THE 55 SPEC2000 AND SPEC2006 TEST PROGRAMS 0
11 5 SPEC2000 Benchmarks trace length data size(64b lines) avgfp slowdown(x) allfp slowdown(x) min.4 E E max 6.05 E E mean 8.20 E E TABLE III COMPARISON OF THE MIN, MAX, AND AVERAGE SLOWDOWNS BY AVERAGEFOOTPRINT ANALYSIS AND BY ALLFOOTPRINT ANALYSIS. ON AVERAGE, AVERAGEFOOTPRINT ANALYSIS IS 38 TIMES FASTER. benchmark inputs min n(0 9 ) max n(0 9 ) min m( ) max m( ) min d i max d i 86.crafty ammp gap gzip mesa equake gcc art parser bzip vpr mcf twolf namd gromacs lbm GemsFDTD povray calculix sphinx h264ref cactusADM zeusmp bwaves mcf sjeng bzip perlbench leslie3d gcc milc soplex hmmer libquantum tonto astar gamess omnetpp gobmk TABLE V SIMILARITY OF THE FOOTPRINT IN DIFFERENT EXECUTIONS OF THE 37 SPEC2K/2006 BENCHMARKS AS MEASURED BY THE MAX AND MIN MANHATTAN DISTANCE (max d i, min d i )
Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters
Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Fan Deng University of Alberta fandeng@cs.ualberta.ca Davood Rafiei University of Alberta drafiei@cs.ualberta.ca ABSTRACT
More informationHow To Write Fast Numerical Code: A Small Introduction
How To Write Fast Numerical Code: A Small Introduction Srinivas Chellappa, Franz Franchetti, and Markus Püschel Electrical and Computer Engineering Carnegie Mellon University {schellap, franzf, pueschel}@ece.cmu.edu
More informationCLoud Computing is the long dreamed vision of
1 Enabling Secure and Efficient Ranked Keyword Search over Outsourced Cloud Data Cong Wang, Student Member, IEEE, Ning Cao, Student Member, IEEE, Kui Ren, Senior Member, IEEE, Wenjing Lou, Senior Member,
More informationSpeeding up Distributed RequestResponse Workflows
Speeding up Distributed RequestResponse Workflows Virajith Jalaparti (UIUC) Peter Bodik Srikanth Kandula Ishai Menache Mikhail Rybalkin (Steklov Math Inst.) Chenyu Yan Microsoft Abstract We found that
More informationApproximate Frequency Counts over Data Streams
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku Stanford University manku@cs.stanford.edu Rajeev Motwani Stanford University rajeev@cs.stanford.edu Abstract We present algorithms for
More informationContinuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream
Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream Xuemin Lin University of New South Wales Sydney, Australia Jian Xu University of New South Wales Sydney, Australia
More informationEffective Computation of Biased Quantiles over Data Streams
Effective Computation of Biased Quantiles over Data Streams Graham Cormode Rutgers University graham@dimacs.rutgers.edu S. Muthukrishnan Rutgers University muthu@cs.rutgers.edu Flip Korn AT&T Labs Research
More informationAdaptive Insertion Policies for High Performance Caching
Adaptive Insertion Policies for High Performance Caching Moinuddin K. Qureshi Aamer Jaleel Yale N. Patt Simon C. Steely Jr. Joel Emer ECE Depment The University of Texas at Austin {moin, patt}@hps.utexas.edu
More informationLow Overhead Concurrency Control for Partitioned Main Memory Databases
Low Overhead Concurrency Control for Partitioned Main Memory bases Evan P. C. Jones MIT CSAIL Cambridge, MA, USA evanj@csail.mit.edu Daniel J. Abadi Yale University New Haven, CT, USA dna@cs.yale.edu Samuel
More informationON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION. A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT
ON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT A numerical study of the distribution of spacings between zeros
More informationA Principled Approach to Bridging the Gap between Graph Data and their Schemas
A Principled Approach to Bridging the Gap between Graph Data and their Schemas Marcelo Arenas,2, Gonzalo Díaz, Achille Fokoue 3, Anastasios Kementsietsidis 3, Kavitha Srinivas 3 Pontificia Universidad
More informationILR: Where d My Gadgets Go?
2012 IEEE Symposium on Security and Privacy ILR: Where d My Gadgets Go? Jason Hiser, Anh NguyenTuong, Michele Co, Matthew Hall, Jack W. Davidson University of Virginia, Department of Computer Science
More informationSubspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity
Subspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity Wei Dai and Olgica Milenkovic Department of Electrical and Computer Engineering University of Illinois at UrbanaChampaign
More informationWhy You Should Care About Quantile Regression
Why You Should Care About Quantile Regression Augusto Born de Oliveira Sebastian Fischmeister Dept. of Electrical and Computer Eng. University of Waterloo Waterloo, ON, Canada {a3olivei,sfischme}@uwaterloo.ca
More informationContinuous Profiling: Where Have All the Cycles Gone?
Continuous Profiling: Where Have All the Cycles Gone? JENNIFER M. ANDERSON, LANCE M. BERC, JEFFREY DEAN, SANJAY GHEMAWAT, MONIKA R. HENZINGER, SHUNTAK A. LEUNG, RICHARD L. SITES, MARK T. VANDEVOORDE,
More informationStaring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores
Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores Xiangyao Yu MIT CSAIL yxy@csail.mit.edu George Bezerra MIT CSAIL gbezerra@csail.mit.edu Andrew Pavlo Srinivas Devadas
More informationStaring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores
Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores Xiangyao Yu MIT CSAIL yxy@csail.mit.edu George Bezerra MIT CSAIL gbezerra@csail.mit.edu Andrew Pavlo Srinivas Devadas
More informationPacket Classification for Core Routers: Is there an alternative to CAMs?
Packet Classification for Core Routers: Is there an alternative to CAMs? Florin Baboescu, Sumeet Singh, George Varghese Abstract A classifier consists of a set of rules for classifying packets based on
More informationBitmap Index Design Choices and Their Performance Implications
Bitmap Index Design Choices and Their Performance Implications Elizabeth O Neil and Patrick O Neil University of Massachusetts at Boston Kesheng Wu Lawrence Berkeley National Laboratory {eoneil, poneil}@cs.umb.edu
More informationScalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights
Seventh IEEE International Conference on Data Mining Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights Robert M. Bell and Yehuda Koren AT&T Labs Research 180 Park
More informationExtracting k Most Important Groups from Data Efficiently
Extracting k Most Important Groups from Data Efficiently Man Lung Yiu a, Nikos Mamoulis b, Vagelis Hristidis c a Department of Computer Science, Aalborg University, DK9220 Aalborg, Denmark b Department
More informationTHE PROBLEM OF finding localized energy solutions
600 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 3, MARCH 1997 Sparse Signal Reconstruction from Limited Data Using FOCUSS: A Reweighted Minimum Norm Algorithm Irina F. Gorodnitsky, Member, IEEE,
More informationAn Experimental Comparison of MinCut/MaxFlow Algorithms for Energy Minimization in Vision
In IEEE Transactions on PAMI, Vol. 26, No. 9, pp. 11241137, Sept. 2004 p.1 An Experimental Comparison of MinCut/MaxFlow Algorithms for Energy Minimization in Vision Yuri Boykov and Vladimir Kolmogorov
More informationA Googlelike Model of Road Network Dynamics and its Application to Regulation and Control
A Googlelike Model of Road Network Dynamics and its Application to Regulation and Control Emanuele Crisostomi, Steve Kirkland, Robert Shorten August, 2010 Abstract Inspired by the ability of Markov chains
More informationKing: Estimating Latency between Arbitrary Internet End Hosts
King: Estimating Latency between Arbitrary Internet End Hosts Krishna P. Gummadi, Stefan Saroiu, and Steven D. Gribble Department of Computer Science & Engineering University of Washington, Seattle, WA,
More informationNo One (Cluster) Size Fits All: Automatic Cluster Sizing for Dataintensive Analytics
No One (Cluster) Size Fits All: Automatic Cluster Sizing for Dataintensive Analytics Herodotos Herodotou Duke University hero@cs.duke.edu Fei Dong Duke University dongfei@cs.duke.edu Shivnath Babu Duke
More informationObliviStore: High Performance Oblivious Cloud Storage
ObliviStore: High Performance Oblivious Cloud Storage Emil Stefanov University of California, Berkeley emil@cs.berkeley.edu Elaine Shi University of Maryland, College Park elaine@cs.umd.edu Abstract. We
More informationCostAware Strategies for Query Result Caching in Web Search Engines
CostAware Strategies for Query Result Caching in Web Search Engines RIFAT OZCAN, ISMAIL SENGOR ALTINGOVDE, and ÖZGÜR ULUSOY, Bilkent University Search engines and largescale IR systems need to cache
More informationEstimators Also Need Shared Values to Grow Together
TECHNICAL REPORT TR1104, COMNET, TECHNION, ISRAEL 1 Estimators Also Need Shared Values to Grow Together Erez Tsidon, Iddo Hanniel and Isaac Keslassy Technion Qualcomm {erezts@tx., ihanniel@, isaac@ee.}technion.ac.il
More informationLLVM: A Compilation Framework for Lifelong Program Analysis & Transformation
: A Compilation Framework for Lifelong Program Analysis & Transformation Chris Lattner Vikram Adve University of Illinois at UrbanaChampaign {lattner,vadve}@cs.uiuc.edu http://llvm.cs.uiuc.edu/ ABSTRACT
More information