Lineartime Modeling of Program Working Set in Shared Cache


 Rachel Houston
 2 years ago
 Views:
Transcription
1 Lineartime Modeling of Program Working Set in Shared Cache Xiaoya Xiang, Bin Bao, Chen Ding, Yaoqing Gao Computer Science Department, University of Rochester IBM Toronto Software Lab Abstract Many techniques characterize the program working set by the notion of the program footprint, which is the volume of data accessed in a time window. A complete characterization requires measuring data access in all O(n 2 ) windows in an n element trace. Two recent techniques have significantly reduced the measurement time, but the cost is still too high for realsize workloads. Instead of measuring all footprint sizes, this paper presents a technique for measuring the average footprint size. By confining the analysis to the average rather than the full range, the problem can be solved accurately by a lineartime algorithm. The paper presents the algorithm and evaluates it using the complete suites of 26 SPEC2000 and 29 SPEC2006 benchmarks. The new algorithm is compared against the previously fastest algorithm in both the speed of the measurement and the accuracy of sharedcache performance prediction. Keywords: Footprint, Cache sharing I. INTRODUCTION During a program execution, its working set can be defined as the footprint, which is the volume of data accessed in an execution window. Since the footprint shows the active data usage, it has been used to model resource sharing among concurrent tasks and to improve throughput and enforce fairness, either in memory sharing among multiprogrammed workloads or more recently in cache sharing among multicore workloads. A trace of n data accesses has ( ) n 2 n(n ) 2 distinct windows and therefore n(n ) 2 footprints. Early studies measured program footprints in the shared cache of timesharing systems. Since applications interact between time quanta, it is sufficient to consider just the windows of a single length the length of a scheduling quantum [20], [22]. On today s multicore systems, however, programs interact continuously. A number of techniques were developed to estimate the footprint in alllength windows, but they did not guarantee the precision of the estimation [2], [3], [6], [8], [9], [2]. A recently published technique called allwindow footprint analysis can measure all footprints in O(CKlogM) time, where CK is linear to the length of the trace and M is the volume of data accessed in the trace [23]. For each window length, the analysis shows the maximum size, the minimum size, and the size distribution of footprints in all windows of this length [23]. The analysis is not fully accurate but guarantees a relative precision, e.g. 99%. We call the analysis allfootprint analysis, because it measures the size of every footprint. In this paper, we present averagefootprint analysis. For each window length, the analysis shows the average size of footprints in all windows of this length. While the analysis gives the accurate average, it does not measure the range or the distribution. However, a weaker analysis can often be done faster. Indeed, we show that the average footprint can be measured accurately in linear time O(n) for a trace of length n, regardless of the data size. The average footprint is a function mapping from the length of a execution window to the volume of its data access. Intuitively, the working set increases in larger execution windows. We prove that the average footprint is monotonically nondecreasing. The new analysis precisely quantifies the growth of the average footprint over time. The previous, allfootprint analysis was the key metric used in the composable models of cache sharing [3], [6], [2], [23]. For P programs, there are 2 P corun combinations. A composable model makes 2 P predictions using P singleprogram runs rather than 2 P parallel runs. As an alternative to allfootprint analysis, the new averagefootprint analysis can be used in the composable model to reduce the (footprint) measurement cost asymptotically. To evaluate the speed and usefulness of the averagefootprint analysis, we test it on the complete suites of SPEC 2000 and SPEC 2006 CPU benchmarks and compare the results with the fastest allfootprint analysis [23]. To measure the accuracy of cache sharing prediction, we rank the slowdowns in two and threeprogram coruns on a quadcore machine and compare the predicted ranking with exhaustive testing. Through experiments, we show that the averagefootprint analysis can predict the effect of cache interference as accurately as the allfootprint analysis, yet at only a fraction of its cost. In fact, the cost of allfootprint analysis is too high for it to model SPEC 2006 benchmarks, which have up to.9 trillion accesses to up to GB data. In comparison, the averagefootprint analysis can model all SPEC 2006 benchmarks, finishing most of the programs within a few hours of time. This study has two limitations. First, we are concerned with parallel workloads consisting of only sequential programs that do not share data. We do not consider parallel programs, although similar footprint metrics have been studied to model multithreaded workloads [6], [7]. Second, the footprint results are input specific, so they are useful mostly in workload characterization, for example, finding the most and the least interference among a set of benchmark programs.
2 rd 5 b) Footprint windows and the cache sharing model: Offline cache relative sharing models O(TlogN) were pioneered O(CKlogN) by Chandra et al. [3] thread A a bcdefa and Suh precision et al. [2] for algorithm a group of independent algorithmprograms and approx. ft 4 extended for multithreaded code by Ding and Chilimbi [6], Schuff et al. [7], and thread B O(TN) Jiang et al. O(CKN) [2] Let A, B be two kmmmnon programs accurate that share algorithm the same cachealgorithm but do not shared data, the effect of B on the locality of A is rd rd+ft 9 thread A&B accurate constantprecision with B) a k bcm d m e m f nona P (capacity miss by A when corunning approximation (a) In shared cache, the reuse distance in program A is P ((A s reuse distance + B s footprint) cache size) lengthened by the footprint of program B. Given an execution allwindow window in statistics a sequential trace, the prog. B In shared cache, B the and reuse A coexecution distance in thread (b) footprint four algorithms is the number for measuring of distinct footprint elements in accessed all in the A is lengthened by the footprint of thread B. execution windows in a trace. T is the length of window. The examples in Figure (a) illustrates the interaction trace and N the largest abbaadaccc axbybyaxaxdwaxczczcz between locality and footprint. In the first example, a reuse 4 cache misses on 3element window in program A concurs with a time window in program fully associative LRU cache. B. The reuse distance of A is lengthened by the footprint of prog. A B s window. The second example uses two pairings of three xyyxxwxzzz traces to show that the shared cache miss rate depends also B2 and A coexecution on the footprint, not just the miss rate of corun threads An implication of the cache sharing model is that cache prog. B2 axbycycxbxcwcxczdzdz interference is asymmetric for programs with different locality 2 cache misses on 3element abccbcccdd fully associative LRU cache. and footprints. A program with large footprints and short reuse distances may disproportionally slow down other programs (b) Programs B and B2 have the same miss rate. However, A and while experiencing little or no slowdown itself. This was B incur 50% more misses in shared cache than A and B2. The difference is caused not by data reuse but by data footprint. observed in experiments [23], [24]. In one program pair, the first program shows a near 85% slowdown while the other Fig.. Example illustrations of cache sharing. Programs B and B2 have the same miss rate. However, AB incurs 50% more program shows only a 5% slowdown. misses in shared cache than AB2. The difference is caused not by data reuse but by footprint. II. BACKGROUND ON OFFLINE CACHE MODELS Offline cache models do not improve performance directly but can be used to understand the causes of interference and to predict its effect before running the programs (so they may be grouped to reduce interference). Offline analysis measures the effect of all data accesses, not just cache misses. It characterizes a single program unperturbed by other programs and the analysis itself. Such cleanroom metrics avoid the chickenegg problem when programs are analyzed together: the interference depends on the miss rate of corunning programs, but their miss rate in turn depends on the interference. Next we describe first the locality model of private cache and then the model for shared cache. a) Reuse windows and the locality model: For each memory access, the temporal locality is determined by its reuse window, which includes all data accesses between this and the previous access to the same datum. Specifically, whether the access is a cache (capacity) miss depends on the reuse distance, the number of distinct data elements accessed in the reuse window. The relation between reuse distance and the miss rate has been well established [9], [6]. The capacity miss rate can be defined by a probability function involving the reuse distance and the cache size. Let the test program be A. P (capacity miss by A alone) P (A s reuse distance cache size) singlewindow statistics III. THE MEASUREMENT OF AVERAGE FOOTPRINT A. Definitions Let W be the set of ( n 2) windows of a lengthn trace. Each window w < t, v > has a length t and a footprint v. Let I(p) be a boolean function returning when p is true and 0 otherwise. The footprint function f p(t) averages over all windows of length t: w fp(t) v i W ii(t i t) w w I(t v i W ii(t i t) i W i t) n t + For example, the trace abbb has 3 windows of length 2: ab, bb, and bb. The corresponding footprints are 2,, and, so fp(2) (2 + + )/3 4/3. B. O(n) Algorithm There is a lineartime algorithm that calculates the precise average footprint for all execution windows of a trace. Let n, m be the length of the trace and the number of distinct data used in the trace. The algorithm first measures the follow three quantities: the distribution of the time distances of all data reuses (n m distances) the firstaccess times of all distinct data (m access times) the lastaccess times of all distinct data (exact definition later, m access times) The three quantities can be measured by a single pass over the trace using a hash table with one entry for each distinct data. The cost is linear, O(n) in time and O(m) in space. 2
3 The three measures are the inputs to a formula f p(w). For any window size w(0 < w N), fp(w) computes the average footprint for all windows of size w. In other words, the formula computes the average footprint for windows of all sizes without having to inspect the trace again. In the rest of the section, we derive the formula and discuss its complexity. The main idea of the formula is differential counting, which counts the difference in the footprint between consecutive windows. For any window size w, we start with the footprint in the first window and then compute its increase or decrease as the window moves forward in the trace. The firstaccess times are sufficient to compute the footprint of the first window. The change in later windows depends on two metrics on each trace element d i : the forward time distance fwd(d i ) and the backward time distance bwd(d i ). Let datum x be accessed at d i. Let the closest accesses of x be d j before d i and d k after d i. Then fwd(d i ) k i and bwd(d i ) i j. The forward and backward time distances determine the change of footprint between consecutive windows. The relation is shown in Figure 2. diw diw+ fp(iw) fp(iw+) bwd(di) di di fp(i) fwd(di) di+ fp(i+) di+w di+w Fig. 2. An illustration how the forward and backward (reuse) time distance influences the change in footprint between consecutive windows Let the footprint of a wsize window starting at i be fp(i). Each element d i in the trace affects the footprint of w windows: fp(i w+), fp(i w+2),..., fp(i). In differential counting, we consider only the effect of d i on two pairs of windows: the change from fp(i w) to fp(i w + ) when d i enters into its first window and the change from fp(i) to fp(i + ) when d i exits from its last window of influence. Figure 2 shows d i and the two pairs of windows where d i enters between the first pair and exits between the second pair. When d i enters, it does not increase the footprint fp(i w) if the same datum was previously accessed within f p(i w +), which means that its backward time distance is no greater than w (bwd(d i ) w). This is the case illustrated in Figure 2. Otherwise, d i adds to the footprint fp(i w). Similarly, when d i exits from fp(i), the departure does not change fp(i + ) if fwd(d i ) w; otherwise, it subtracts from fp(i + ), as in the case illustrated in Figure 2. The footprint f p(i + ) depends on three factors: the footprint fp(i), the contribution of the entering d i+w, and the detraction of the exiting d i. The footprint of all windows is then computed by adding these differences. Next we formulate this computation. We use the following notations. n, m, w: the length of the trace, the size of data, and the window size of interest d i : the ith trace access fp(i): the footprint of the window from d i to d i+w (including d i and d i+w ) dk bwd(d i ): the backward reuse time distance of d i, if d i is the first access. fwd(d i ): the forward reuse time distance of d i, if d i is the last access. I(p): a boolean function that returns if p is true and 0 otherwise. For example, I(bwd(d i ) > w) gives the contribution by d i, which is if bwd(d) > w and 0 otherwise. Similarly, I(fwd(d i ) > w) gives the detraction of d i, if fwd(d) > w and 0 otherwise. The total size of the footprints in all windows of length w, when divided by the number of windows n w +, is the average footprint, as shown next in Equation. Since fp(w) n w+ X fp(i) () n w + fp(i + ) fp(i) + I(bwd(d i+w ) > w) I(fwd(d i ) > w) (2) Expanding Equation using Equation 2, we have three components in the average footprint: fp(w) fp() + n iw+ n w n w + ( (n i + )I(bwd(d i ) > w) (n i + w)i(fwd(d i ) > w)) (3) Next we compute each component separately. The footprint of the first window of length w is fp() w I(bwd(d i ) ) (4) In the next component, we split the forward time distances into two groups: finite and infinite distances. The summation order of the finite distances can be changed from to n instead of from w + to n. n iw+ n iw+ + n iw+ (n i + )I(bwd(d i ) > w) (5) (n i + )I(w < bwd(d i ) < ) (n i + )I(bwd(d i ) ) n (n i + )I(w < bwd(d i ) < ) + n iw+ (n i + )I(bwd(d i ) ) Similarly, we decompose and simplify the forward distances: 3
4 (n i + w)i(fwd(d i ) > w) (6) n w n (n i + w)i(w < fwd(d i ) < ) n w + (n i + w)i(fwd(d i ) ) Combining the Equations 4, 5, and 6, we can now expand Equation 3. Instead of using individual accesses, we now use the three inputs, defined as follows: f i : the first access time of the ith datum l i : the reverse last access time of the ith datum. If the last access is at position x, l i n + x, that is, the first access time in the reverse trace. r t : the number of accesses with a reuse time distance t fp(w) n I(bwd(d i ) ) + n w + ( n (w i)i(bwd(d i ) ) iw+ n w + (n i + w)i(fwd(d i ) ) n (n i + )I(w < bwd(d i ) < ) n (n i + w)i(w < fwd(d i ) < )) M m + n w + ( (w f i )I(f i > w) + M (w l i )I(l i > w) n + (w t)i(t > w)r t ) t m m n w + ( (f i w)i(f i > w) m + (l i w)i(l i > w) + n tw+ (t w)r t ) (7) The formula of Equation 7 passes the sanity check that the average footprint fp(w) is at most the data size m, and the footprint of the whole trace (w n) is m. Fixing the window length w and ignoring the effect of first and last accesses, we see that the footprint decreases if more reuse time distances (r t ) have larger values (t). This suggests that improving locality reduces the average footprint. For example, if we double the length of a trace by repeating each element twice, the length of the long time distances would double, and the average footprint would drop. For each window length w, the Equation 7 can be computed in time O(w). If we limit to consider only window sizes of a logarithmic scale, the formula can be represented and evaluated in O(log w) time. C. Monotonicity Theorem 3.: The average footprint f p(w) is nondecreasing. Proof: Let wi k denotes the ith window whose size is k, f(wi k ) denotes the footprint of the ith window whose size is k. We prove that, k, 0 < k n, fp(k + ) fp(k). First, i, 0 < i n k, the following holds because wi k and wi+ k are both contained in wk+ i : f(w k+ i ) f(w k i ) f(w k+ i ) f(w k i+ ) In addition, we have k, 0 < k n, j, 0 < j n k +, such that, f(wj k ) fp(k). Now then fp(k + ) n k f(w k+ i ) n k j n k [ n k f(w k+ i ) + ij j n k [ n k f(wi k ) + ij f(w k+ i )] f(w k i+)] j n k [ n k+ f(wi k ) + f(wi k )] ij+ n k+ n k [ f(wi k ) f(wj k )] n k [(n k + )fp(k) f(wk j )] fp(k) + n k [fp(k) f(wk j )] fp(k) IV. AVERAGE FOOTPRINT IN THE COMPOSABLE MODEL Our previous work used allfootprint analysis in the composable model to predict cache interference [23]. In the composable model, when multiple programs are run together, each reuse distance in a program is lengthened by the aggregate footprint of all peer programs over the same time window. Suppose there are n programs t, t 2,..., t n running on a shared cache, the miss rate is computed by P (capacity miss by t i running with t j, j,..., n, j i ) P ((t i s reuse distance + j i t j s footprint) cache size) Suppose the distribution of program t i s reuse distance is D rd (t i ), and the distribution of program t i s footprint of 4
5 window size w is D fp (t i, w). The first distribution is defined as D rd (t i ) {< x ki, p ki > p k } where < x ki, p ki > means the probability of the reuse distance equals x ki is p ki. Similarly, we define D fp (t i, w) {< yk w i, qk w i > qk w i } Given a window size w, we use < yk w i, qk w i > to mean that the probability that the footprint equals yk w i is qk w i. Consider a 2program corun involving t and t 2. The capacity miss rate by t is calculated as follows by Equation 8. mr(t ) k k 2 p k q w(xk ) k 2 I(x k + y w(x k ) k 2 C) (8) where I is the identity function, and w(x k ) is the size of the reuse window that contains the reuse distance x k. This is the equation employed by allfootprint based modeling [23]. To use averagefootprint analysis instead, we define the average footprint of a window size w for program t i as F (t i, w) fi w. Equation 8 can be simplified to Equation 9. mr(t ) k p k I(x k + f w(x k ) 2 C) (9) The estimation of the execution time from the miss rate is the same as [23]. The only difference is that the previous model uses allfootprint analysis and Equation 8, and the new model uses averagefootprint analysis and the simpler Equation 9. A. Experimental Setup V. EVALUATION We have implemented the averagefootprint analysis algorithm in a profiling tool and tested 26 SPEC2K benchmarks, 2 integer and 4 floatingpoint, and 29 SPEC2006 benchmarks, 2 integer and 7 floatingpoint. All benchmarks are instrumented by Pin [5] and profiled on a machine with an Intel Core i5660 processor and 4GB physical memory. The machine is set up with Fedora 3 and GCC The twoprogram corun results for SPEC 2000 are collected on an Intel Core 2 Duo machine with two 2.0GHz cores sharing 2MB L2 cache and 2GB memory. In order to measure 3program coruns, we use an Intel quadcore machine, with four 2.27GHz cores sharing 8MB L3 cache and 8GB memory. Except in Section VG when we examine the effect of input, we use the reference input in the test. Some programs, especially SPEC 2006, have multiple reference inputs. We use the first one tested by the autorunner. In performance comparisons, the base program run time is one without Pin instrumentation or any other analysis. The length of SPEC 2000 traces ranges from 4 billion in gcc to 425 billion in mgrid. The amount of data ranges from 3 thousand 64byte cache blocks (MB) in eon to 3.2 million cache blocks (256MB) in gcc. The SPEC 2006 traces on average are 0 times as long as SPEC 2000 traces and have 5 times as many cache blocks. The trace bwaves is the longest with.9 trillion data accesses and has the most data, 928MB. The individual statistics of the 55 programs is listed in Table II. To evaluate cachesharing predictions, we run two experiments: 2program coruns. We predict all 2program coruns and compare the predicted ranking with that of the previous work using the 5 SPEC 2000 benchmarks used in the previous work [23]. 3program coruns. We started with the 0 representative benchmarks in SPEC2006 as selected by Zhuravlev et al. [29]. Reusedistance analysis was too slow to measure 2 programs. We evaluate the prediction for all program triples of the remaining 8 programs. In both tests, we also compare with a simple prediction method based on miss rates (by ranking the total miss rate of the programs in the corun group) [23]. B. Efficiency of Averagefootprint Analysis Table I summarizes the analysis cost for the two benchmark suites, and for each suite, the average for integer and for floatingpoint programs. It divides the 55 tests into four groups: 2 SPEC 2000 integer programs, 4 SPEC 2000 floatingpoint programs, 2 SPEC 2006 integer programs, and 7 SPEC 2006 floatingpoint programs. The result of each group is summarized in three rows and three columns. The columns show the trace length, the data size, and the slowdown ratio of the profiling time to the unmodified run time. The rows show the minimum, maximum, and the average slowdown factors for all benchmarks of the group. The minimum slowdowns in four benchmark groups are all below 0. The maximum slowdowns are 40, 32, 4, and 74. The average slowdowns are between 2 in SPEC 2006 integer tests and 29 in SPEC 2000 integer tests. On average across all four groups, averagefootprint analysis takes no more than 30 times of the original execution time. The individual results of the 55 programs are shown in Table II. Compared to the summary table, the individualresult table has two additional columns, which show the unmodified execution time and the time of averagefootprint analysis. The unmodified time measures the execution of the original program without any instrumentation or analysis. On average, an unmodified SPEC 2000 program takes less than 3 minutes, and an unmodified SPEC 2006 program takes close to 0 minutes. Averagefootprint analysis takes 3 to 73 minutes for SPEC 2000 programs and 0 minutes (gcc) to 0 hours (calculix) for SPEC 2006 programs. C. Comparison with Allfootprint Analysis Allfootprint analysis can analyze SPEC 2000 programs but not SPEC 2006 programs. We compare average and allfootprint analysis on SPEC 2000 programs in Table III. SPEC 2000 has 26 programs in total. The paper on allfootprint analysis reported results for 5 of the programs [23]. The table summarizes the cost of the two analyses in these 5 tests in the last two columns. The slowdowns by averagefootprint analysis are between 8.8 and 40. The slowdowns by allfootprint analysis are between 248 and 360. The average slowdown is 40 for averagefootprint analysis and about 500 5
6 benchmarks stats trace length data size(64b lines) avgfp slowdown(x) SPEC2000 INT min.4 E E programs max 6.05 E E mean 7.52 E+0.67 E SPEC2000 FP min 3.03 E E programs max E E mean 7.44 E E SPEC2006 INT min 4.88 E E programs max 5.47 E E mean E E SPEC2006 FP min E E programs max E E mean E E TABLE I THE MIN, MAX, AND AVERAGE COSTS (SLOWDOWNS) OF AVERAGEFOOTPRINT ANALYSIS FOR 55 SPEC 2000 AND SPEC 2006 BENCHMARK PROGRAMS for allfootprint analysis. In other words, on average for these 5 programs, averagefootprint analysis is 38 times faster than allfootprint analysis. Allfootprint analysis takes too long for SPEC 2006 programs. For example, it takes averagefootprint analysis 0 hours to profile calculix. Being 38 times slower, it would take more than two weeks to measure the allfootprint distribution. D. TwoProgram Corun Ranking The prior work showed corun ranking results for 5 SPEC 2000 programs based on allfootprint analysis and compared with missrate based ranking and exhaustive testing [23]. We now show the ranking results using averagefootprint analysis and compare it with the three previous ranking methods. We show the prediction results in a 2D plot. The xaxis is the rank of program corun groups. In this test, the rank ranges from (the least interfering pair) to 05 (the most interfering pair). The yaxis shows the interference, measured by the quadratic mean of the slowdowns of programs in the corun group. The slowdown of a corun program is the ratio of its corun time and the time running alone on the same machine (cache). For two programs with slowdowns s, s 2, s 2 +s we have y The three graphs in Figure 3 show the plots for the predictions based on miss rate, allfootprint analysis, and averagefootprint analysis. In each plot, the accurate result from exhaustive testing is shown by a monotonically increasing red line as a reference. The simple missrate based prediction does not show an increasing trend, suggesting no correlation between the prediction and the actual interference. The two footprintbased predictions show significant correlation. Programs predicted to have a high interference tend to actually have a high interference. Averagefootprint analysis ranks several program pairs better than allfootprint analysis. Consider the pair with the highest interference, art,mcf with a slowdown of 2. The pair is ranked 23 by miss rate, 99 by allfootprint analysis, and 05 by averagefootprint analysis. The averagefootprint rank is precisely correct. Allfootprint ranking has a significant misprediction for the program pair gcc,art. The pair slows down each other by.6 times. It should be ranked 97 but ranked 44 by allfootprint analysis, which is worse than the missrate rank 70. The rank by averagefootprint analysis is relatively the best at 86. E. ThreeProgram Corun Ranking Evaluating larger group coruns is difficult because the number of tests increases exponentially with the size of corun group. To test all 3program coruns in SPEC 2006 benchmarks, we would have to run ( ) tests. Even if we ran all the tests, it would have been impossible to show the results clearly. Fortunately, Zhuravlev et al. have analyzed the benchmark set based on the cache miss rates and access rates and identified 0 representatives [29]. We had to narrow down further because the reusedistance analysis could finish only for 8 out of the 0 representatives: 403.gcc, 46.gamess, 429.mcf, 444.namd, 445.gobmk, 450.soplex, 453.povray, and 470.lbm. There are 56 different 3program groups from these 8 benchmarks. We show the prediction results in Figure 4. The results for 3program coruns of SPEC 2006 programs are similar to those of 2program coruns of SPEC 2000 programs. As before, the missrate based prediction does not show a detectable correlation while the averagefootprint analysis shows a clear correlation with the actual interference. The maximal slowdown increases from 2.0 in 2program coruns to 3.3 in 3program coruns, confirming the expectation that the interference becomes worse as the cache is shared by more programs. Exhaustive testing is also increasingly infeasible. For both reasons, the composable model is more valuable, so is the higher efficiency from the averagefootprint analysis. F. Rank and Performance Closeness To quantify the difference between the predicted ranking and the accurate ranking, we define two metrics: the rank closeness and the performance closeness. The rank closeness shows on average how the predicted rank of a corun group differs from the actual rank. We number n corun groups by their accurate rank i. Let pred(i) be the predicted rank for group i. The rank closeness is defined as n pred(i) i rank closeness n 6
7 slowdown miss rate exhaustive testing slowdown miss rate exhaustive testing ranked program pairs (from least interference to most interference) ranked program triples (from least interference to most interference) slowdown all footprint exhaustive testing slowdown average footprint exhaustive testing ranked program pairs (from least interference to most interference) ranked program triples (from least interference to most interference) slowdown average footprint exhaustive testing ranked program pairs (from least interference to most interference) Fig. 3. Evaluation of 2program corun predictions for 5 SPEC2000 benchmark programs. The prediction quality of averagefootprint analysis is similar to that of allfootprint analysis. The formula is the Manhattan distance between two vectors < p(), p(2),..., p(n) > and <, 2,..., n >, divided by n. The worst possible ranking has a rankcloseness score of n/2 if n is even or (n )/2 otherwise. Next we quantify the error in terms of the mispredicted slowdown. Let f(i) be the slowdown of the corun group with the accurate rank i, and f(pred(i)) be the slowdown Fig. 4. Comparison of 3program corun predictions for 8 SPEC2006 benchmarks. Allfootprint analysis cannot model these programs because of its high cost. of the corun group with the predicted rank i. The difference f(pred(i)) f(i) gives the misprediction. The performance closeness is the average misprediction for all groups: n f(pred(i)) f(i) performance closeness n The two metrics are shown in Table IV. On average for 2 program coruns, the missrate rank errs by 35 positions, while the footprintbased ranks err by 4 and 5 positions. For 3 program coruns, the missrate rank errs by 9 positions, while the averagefootprint rank errs by 6. In terms of performance, the missrate based ranking mispredicts twice as bad as the footprintbased ranking. In search of a closeness metric, we also measured the Levenshtein distance. For two permutations of a set of numbers, the Levenshtein distance measures the number of edits needed to convert one to the other. For the 2program corun test, the distance is 03 for miss rate, 97 for allfootprint, and 96 for averagefootprint. For the 3program corun test, the distance is 54 for miss rate and 48 for averagefootprint. Levenshtein is not a good metric since it does not distinguish a ranking that does not show a correlation from rankings that do. 7
8 2program corun over 5 SPEC2000 benchmarks ranking strategy perf closeness rank closeness miss rate allfootprint avgfootprint program corun over 0 SPEC2006 benchmarks ranking strategy perf closeness rank closeness miss rate avgfootprint TABLE IV COMPARE DIFFERENT RANKING STRATEGIES G. The Effect of Input on Average Footprint The footprint of a program execution is affected by the program input just as the length of the execution is affected by the input. An important question for profilingbased techniques is how much the footprints in training runs may differ from those in test runs. In this section, we give a preliminary measure of this difference. Given a set of k executions of the same program, we quantify the variation between the k footprint functions (f i (w)) as follows. First, we compute the average of the average footprints: f(w) k k f i(w). Then we compute the Manhattan distance between the ith execution and the average as: d i W j fi(wj) f(w j) ) W where W is the number of different window lengths. A Manhattan distance of x% means that on average, the input i s footprint function differs from the average by x% in each window size. Table V shows SPEC2000 and SPEC2006 programs, the number of inputs (provided by the benchmark suite and tested in our experiments), the range of trace lengths and data sizes in these inputs, the smallest and largest Manhattan distances as we just defined. The majority of programs, 20 out of 37, see no more than 30% difference between footprints of different inputs. The minimal difference is less than 20% in all but 5 programs. Note that the effect of the input may be predicted using model fitting based on the input characteristics [26]. This is outside the scope of this paper. VI. RELATED WORK Locality models: Locality in private cache can be modeled by reuse distance, which can be measured with a guaranteed precision in time O(n log 2 m), where n is the length of the trace and m is the size of data [26]. Reuse distance has found many uses in workload characterization and program optimization [26]. There are a number of recent developments. Chauhan and Shei gave a method for static analysis of locality in MATLAB code [4]. Unlike profiling whose results are usually input specific, static analysis can identify and model the effect of program parameters. Most previous models targeted program analysis. Ibrahim and Strohmaier used synthetic probes to emulate the locality of an application for efficient machine characterization [0]. Zhou studied random cache replacement policy and gave a onepass deterministic traceanalysis algorithm to compute the average miss rate (instead of simulating many times and taking the average) [28]. Finally, Schuff et al. defined multicore reuse distance analysis and improved its efficiency through sampling and parallelization [7]. The sampling was based on a method developed by Zhong and Chang earlier [25]. These techniques are concerned with only reuse windows and cannot measure the footprint in all execution windows, which is the problem addressed in this paper. Offline cache sharing models: The average workingset size in singlelength execution windows such as a scheduling quantum can be computed in linear time. It has been used in studying multiprogrammed systems [7], [22]. In a parallel environment such as today s multicore processors, programs interact constantly. The interference in alllength windows has been considered for memory [2] and for cache [3]. Both used the following recursive equation involving the working set and the miss rate. As a window of size w is extended to w +, the change in the working set depends on whether the new access is a miss. Suh et al. assumed linear function growth when window sizes were close [2]. Chandra et al. computed the recursive relation bottom up [3]. The same problem has been solved using statistical inference. Two techniques by Berg and Hagersten (known as StatCache) [2] and by Shen et al. [9] were used to infer cache miss rate from the distribution of reuse times. Berg and Hagersten assumed constant miss rate over time and random cache replacement [2], and Shen et al. assumed a Bernulli process and LRU cache replacement [8], [9]. The latter method was adapted to predict cache interference [2]. A precise prediction was shown useful in an approximately solving the optimal coscheduling problem []. However, none of these method can bound the approximation error. Our earlier work gave the first precise methods for measuring the footprint [5], [23] and an iterative model for the circular effect of cache interference [23]. The lineartime algorithm in this paper computes the average rather than the full distribution and improves the measurement speed by near 40 times yet maintains a similar accuracy in sharedcache locality prediction. Online models: The miss rate curve has been used for memory partitioning to ensure fairness or maximize throughput in a parallel workload [27]. Similarly, reuse distance has been used for cache partitioning among data objects [4]. Recently, Zhuravlev et al. reviewed four models based on the miss rate and the reuse distance [29]. As online models, these techniques did not consider the working set metrics because of the cost. For example, Zhuravlev et al. considered a less accurate model from Chandra et al. because for efficiency it did not require allwindow footprints [3]. Zhuravlev et al. showed that cache sharing is one of the factors but not necessarily the major factor [29]. Still, an accurate and fast solution may help to quantify the contribution from cache sharing in the overall interference. Analytical models and streaming analysis: Counting the number of distinct data items has been considered as a 8
9 streaming analysis problem. Spaceefficient (less than O(m)) solutions exist to measure frequency moments F 0 (footprint), F (total frequency), F 2, F (most frequent item), and entropy [], [8], [3]. Instead of counting the F 0 moment over the whole trace, we solve the problem of collecting the average F o for all execution windows and focus on reducing the time complexity from O(n 2 ) to linear. Streaming solutions may be combined to further reduce the space requirement of our algorithms. VII. SUMMARY Complete characterization of footprint requires measuring data access in all execution windows. In this paper, we have presented the average footprint as a metric of allwindow footprint. The footprint function maps from time to average footprint, We have shown that the average footprint function is monotone and can be used in the composable model to rank cache interference in shared cache without having to test any parallel executions. We have presented a lineartime algorithm for accurately measuring the average footprint. The lineartime algorithm uses differential counting based on the forward and backward reuse time distance. When tested on SPEC CPU 2000 benchmarks, the averagefootprint analysis is on average 38 times faster than the previous, allfootprint analysis, yet it shows comparable accuracy in sharedcache locality prediction. The averagefootprint analysis was efficient enough to measure the newer, SPEC 2006 benchmarks, but the allfootprint analysis could not. ACKNOWLEDGMENT We would like to thank Tongxin Bai for providing histogram mapping libraries. The presentation has been improved by the suggestions from Xipeng Shen and the systems group at University of Rochester. Xiaoya Xiang and Bin Bao are supported by two IBM Center for Advanced Studies Fellowships. The research is also supported by the National Science Foundation (Contract No. CCF604, CCF , CNS ). REFERENCES [] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the ACM Symposium on Theory of Computing, pages 20 29, 996. [2] E. Berg and E. Hagersten. Fast datalocality profiling of native execution. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems, pages 69 80, [3] D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting interthread cache contention on a chip multiprocessor architecture. In Proceedings of the International Symposium on HighPerformance Computer Architecture, pages , [4] A. Chauhan and C.Y. Shei. Static reuse distances for localitybased optimizations in MATLAB. In International Conference on Supercomputing, pages , 200. [5] C. Ding and T. Chilimbi. Allwindow profiling of concurrent executions. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, poster paper. [6] C. Ding and T. Chilimbi. A composable model for analyzing locality of multithreaded programs. Technical Report MSRTR , Microsoft Research, August [7] B. Falsafi and D. A. Wood. Modeling cost/performance of a parallel computer simulator. ACM Transactions on Modeling and Computer Simulation, 7():04 30, 997. [8] P. Flajolet and G. Martin. Probabilistic counting. In Proceedings of the Symposium on Foundations of Computer Science, 983. [9] M. D. Hill and A. J. Smith. Evaluating associativity in CPU caches. IEEE Transactions on Computers, 38(2):62 630, 989. [0] K. Z. Ibrahim and E. Strohmaier. Characterizing the relation between ApexMap synthetic probes and reuse distance distributions. Proceedings of the International Conference on Parallel Processing, 0: , 200. [] Y. Jiang, X. Shen, J. Chen, and R. Tripathi. Analysis and approximation of optimal coscheduling on chip multiprocessors. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages , [2] Y. Jiang, E. Z. Zhang, K. Tian, and X. Shen. Is reuse distance applicable to data locality analysis on chip multiprocessors? In Proceedings of the International Conference on Compiler Construction, pages , 200. [3] A. Lall, V. Sekar, M. Ogihara, J. Xu, and H. Zhang. Data streaming algorithms for estimating entropy of network traffic. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems, pages 45 56, [4] Q. Lu, J. Lin, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. SoftOLP: Improving hardware cache performance through softwarecontrolled objectlevel partitioning. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages , [5] C.K. Luk et al. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, Chicago, Illinois, June [6] R. L. Mattson, J. Gecsei, D. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM System Journal, 9(2):78 7, 970. [7] D. L. Schuff, M. Kulkarni, and V. S. Pai. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 53 64, 200. [8] X. Shen and J. Shaw. Scalable implementation of efficient locality approximation. In Proceedings of the Workshop on Languages and Compilers for Parallel Computing, pages , [9] X. Shen, J. Shaw, B. Meeker, and C. Ding. Locality approximation using time. In Proceedings of the ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, pages 55 6, [20] H. S. Stone, J. Turek, and J. L. Wolf. Optimal partitioning of cache memory. IEEE Transactions on Computers, 4(9): , 992. [2] G. E. Suh, S. Devadas, and L. Rudolph. Analytical cache models with applications to cache partitioning. In International Conference on Supercomputing, pages 2, 200. [22] D. Thiébaut and H. S. Stone. Footprints in the cache. ACM Transactions on Computer Systems, 5(4): , 987. [23] X. Xiang, B. Bao, T. Bai, C. Ding, and T. M. Chilimbi. Allwindow profiling and composable models of cache sharing. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 9 02, 20. [24] X. Zhang, S. Dwarkadas, and K. Shen. Towards practical page coloringbased multicore cache management. In Proceedings of the EuroSys Conference, [25] Y. Zhong and W. Chang. Samplingbased program locality approximation. In Proceedings of the International Symposium on Memory Management, pages 9 00, [26] Y. Zhong, X. Shen, and C. Ding. Program locality analysis using reuse distance. ACM Transactions on Programming Languages and Systems, 3(6): 39, Aug [27] P. Zhou, V. Pandey, J. Sundaresan, A. Raghuraman, Y. Zhou, and S. Kumar. Dynamic tracking of page miss ratio curve for memory management. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages 77 88, [28] S. Zhou. An efficient simulation algorithm for cache of random replacement policy. In Proceedings of the IFIP International Conference on Network and Parallel Computing, pages 44 54, 200. Springer Lecture Notes in Computer Science No [29] S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing shared resource contention in multicore processors via scheduling. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages 29 42,
10 benchmark trace length data size(64b lines) unmodified time(sec) avgfp analysis time FP alg cost(x) 64.gzip 3.93 E E vpr 6.26 E E gcc.4 E E mcf 2.29 E E crafty.36 E E parser 6.05 E+0 4. E eon 7.8 E E perlbmk 4.69 E E gap.6 E E vortex 6.45 E E bzip E E twolf 4.74 E E wupwise 5.68 E E swim 9.02 E E mgrid E E applu 8.47 E E mesa 5.85 E+0.37 E galgel E E art 3.03 E E equake 6.87 E E facerec 4.62 E E ammp 4.26 E E lucas 5.07 E E fma3d 6.04 E E sixtrack 4.20 E E apsi 8.69 E E perlbench 2.99 E E bzip E+0.98 E gcc 4.88 E E mcf 2.6 E E gobmk 2.48 E E hmmer E E sjeng 0.99 E E libquantum 5.47 E E h264ref 4.0 E E omnetpp E E astar E E xalancbmk E E bwaves 90.5 E E gamess 44.6 E E milc 5.48 E E zeusmp E E gromacs E E cactusADM E E leslie3d 2.2 E E namd 7.2 E E dealII E E soplex E E povray 67.9 E E calculix E E GemsFDTD E E tonto E E lbm E E wrf E E sphinx E E TABLE II INDIVIDUAL STATISTICS OF THE 55 SPEC2000 AND SPEC2006 TEST PROGRAMS 0
11 5 SPEC2000 Benchmarks trace length data size(64b lines) avgfp slowdown(x) allfp slowdown(x) min.4 E E max 6.05 E E mean 8.20 E E TABLE III COMPARISON OF THE MIN, MAX, AND AVERAGE SLOWDOWNS BY AVERAGEFOOTPRINT ANALYSIS AND BY ALLFOOTPRINT ANALYSIS. ON AVERAGE, AVERAGEFOOTPRINT ANALYSIS IS 38 TIMES FASTER. benchmark inputs min n(0 9 ) max n(0 9 ) min m( ) max m( ) min d i max d i 86.crafty ammp gap gzip mesa equake gcc art parser bzip vpr mcf twolf namd gromacs lbm GemsFDTD povray calculix sphinx h264ref cactusADM zeusmp bwaves mcf sjeng bzip perlbench leslie3d gcc milc soplex hmmer libquantum tonto astar gamess omnetpp gobmk TABLE V SIMILARITY OF THE FOOTPRINT IN DIFFERENT EXECUTIONS OF THE 37 SPEC2K/2006 BENCHMARKS AS MEASURED BY THE MAX AND MIN MANHATTAN DISTANCE (max d i, min d i )
AllWindow Profiling and Composable Models of Cache Sharing
AllWindow Profiling and Composable Models of Cache Sharing Xiaoya Xiang Bin Bao Tongxin Bai Chen Ding Computer Science Department University of Rochester Rochester, NY 14627 {xiang,bao,bai,cding}@cs.rochester.edu
More informationCache Conscious Task Regrouping on Multicore Processors
Cache Conscious Task Regrouping on Multicore Processors Xiaoya Xiang, Bin Bao, Chen Ding and Kai Shen Department of Computer Science, University of Rochester Rochester, NY, USA {xiang, bao, cding, kshen}@cs.rochester.edu
More informationAn OSoriented performance monitoring tool for multicore systems
An OSoriented performance monitoring tool for multicore systems J.C. Sáez, J. Casas, A. Serrano, R. RodríguezRodríguez, F. Castro, D. Chaver, M. PrietoMatias Department of Computer Architecture Complutense
More informationAchieving QoS in Server Virtualization
Achieving QoS in Server Virtualization Intel Platform Shared Resource Monitoring/Control in Xen Chao Peng (chao.p.peng@intel.com) 1 Increasing QoS demand in Server Virtualization Data center & Cloud infrastructure
More informationHow Much Power Oversubscription is Safe and Allowed in Data Centers?
How Much Power Oversubscription is Safe and Allowed in Data Centers? Xing Fu 1,2, Xiaorui Wang 1,2, Charles Lefurgy 3 1 EECS @ University of Tennessee, Knoxville 2 ECE @ The Ohio State University 3 IBM
More informationPerformance Characterization of SPEC CPU2006 Integer Benchmarks on x8664 64 Architecture
Performance Characterization of SPEC CPU2006 Integer Benchmarks on x8664 64 Architecture Dong Ye David Kaeli Northeastern University Joydeep Ray Christophe Harle AMD Inc. IISWC 2006 1 Outline Motivation
More informationsystems for program behavior analysis. 2 Programming systems is an expanded term for compilers, referring to both static and dynamic
Combining Locality Analysis with Online Proactive Job CoScheduling in Chip Multiprocessors Yunlian Jiang Kai Tian Xipeng Shen Computer Science Department The College of William and Mary, Williamsburg,
More informationComparison of Processor Performance of SPECint2006 Benchmarks of some Intel Xeon Processors
Comparison of Processor Performance of SPECint2006 Benchmarks of some Intel Xeon Processors Abdul Kareem PARCHUR * and Ram Asaray SINGH Department of Physics and Electronics, Dr. H. S. Gour University,
More informationAnalysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking
Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking Kathlene Hurt and Eugene John Department of Electrical and Computer Engineering University of Texas at San Antonio
More informationSchedulability Analysis for Memory Bandwidth Regulated Multicore RealTime Systems
Schedulability for Memory Bandwidth Regulated Multicore RealTime Systems Gang Yao, Heechul Yun, Zheng Pei Wu, Rodolfo Pellizzoni, Marco Caccamo, Lui Sha University of Illinois at UrbanaChampaign, USA.
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 6 Fundamentals in Performance Evaluation Computer Architecture Part 6 page 1 of 22 Prof. Dr. Uwe Brinkschulte,
More informationVirtualisierung im HPCKontext  Werkstattbericht
Center for Information Services and High Performance Computing (ZIH) Virtualisierung im HPCKontext  Werkstattbericht ZKI Tagung AK Supercomputing 19. Oktober 2015 Ulf Markwardt +49 351463 33640 ulf.markwardt@tudresden.de
More informationCPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate:
CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate: Clock cycle where: Clock rate = 1 / clock cycle f = 1 /C
More informationTowards Architecture Independent Metrics for Multicore Performance Analysis
Towards Architecture Independent Metrics for Multicore Performance Analysis Milind Kulkarni, Vijay Pai, and Derek Schuff School of Electrical and Computer Engineering Purdue University {milind, vpai, dschuff}@purdue.edu
More informationCharacterizing the Unique and Diverse Behaviors in Existing and Emerging GeneralPurpose and DomainSpecific Benchmark Suites
Characterizing the Unique and Diverse Behaviors in Existing and Emerging GeneralPurpose and DomainSpecific Benchmark Suites Kenneth Hoste Lieven Eeckhout ELIS Department, Ghent University SintPietersnieuwstraat
More informationA Comparison of Capacity Management Schemes for Shared CMP Caches
A Comparison of Capacity Management Schemes for Shared CMP Caches CaroleJean Wu and Margaret Martonosi Department of Electrical Engineering Princeton University {carolewu, mrm}@princeton.edu Abstract
More informationTPCalc : a throughput calculator for computer architecture studies
TPCalc : a throughput calculator for computer architecture studies Pierre Michaud Stijn Eyerman Wouter Rogiest IRISA/INRIA Ghent University Ghent University pierre.michaud@inria.fr Stijn.Eyerman@elis.UGent.be
More informationScheduling on Heterogeneous Multicore Processors Using Architectural Signatures
Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov School of Computing Science Simon Fraser University Vancouver, Canada dsa5@cs.sfu.ca Alexandra Fedorova School
More informationIntel Pentium 4 Processor on 90nm Technology
Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features HyperThreading Technology SSE3 Intel Extended
More informationRunning Virtual Machines in a Slurm Batch System
Center for Information Services and High Performance Computing (ZIH) Running Virtual Machines in a Slurm Batch System Slurm User Group Meeting September 2015 Ulf Markwardt +49 351463 33640 ulf.markwardt@tudresden.de
More informationDynamic Virtual Machine Scheduling in Clouds for Architectural Shared Resources
Dynamic Virtual Machine Scheduling in Clouds for Architectural Shared Resources JeongseobAhn,Changdae Kim, JaeungHan,Youngri Choi,and JaehyukHuh KAIST UNIST {jeongseob, cdkim, juhan, and jhuh}@calab.kaist.ac.kr
More informationFACT: a Framework for Adaptive Contentionaware Thread migrations
FACT: a Framework for Adaptive Contentionaware Thread migrations Kishore Kumar Pusukuri Department of Computer Science and Engineering University of California, Riverside, CA 92507. kishore@cs.ucr.edu
More informationAddressing Shared Resource Contention in Multicore Processors via Scheduling
Addressing Shared Resource Contention in Multicore Processors via Scheduling Sergey Zhuravlev Sergey Blagodurov Alexandra Fedorova School of Computing Science, Simon Fraser University, Vancouver, Canada
More informationA Comparative Review of ContentionAware Scheduling Algorithms to Avoid Contention in Multicore Systems
ICT Innovations 2012 Web Proceedings ISSN 18577288 489 A Comparative Review of ContentionAware Scheduling Algorithms to Avoid Contention in Multicore Systems Genti Daci, Megi Tartari Abstract. Contention
More informationAn Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors
An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew CurtisMaury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos The College of William &
More informationPreventing DenialofService Attacks in Shared CMP Caches
Preventing DenialofService Attacks in Shared CMP Caches Georgios Keramidas, Pavlos Petoumenos, Stefanos Kaxiras, Alexandros Antonopoulos, and Dimitrios Serpanos Department of Electrical and Computer
More informationBenchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
More informationOn the Importance of Thread Placement on Multicore Architectures
On the Importance of Thread Placement on Multicore Architectures HPCLatAm 2011 Keynote Cordoba, Argentina August 31, 2011 Tobias Klug Motivation: Many possibilities can lead to nondeterministic runtimes...
More informationCompilerAssisted Binary Parsing
CompilerAssisted Binary Parsing Tugrul Ince tugrul@cs.umd.edu PD Week 2012 26 27 March 2012 Parsing Binary Files Binary analysis is common for o Performance modeling o Computer security o Maintenance
More informationOffline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com
More informationReducing Dynamic Compilation Latency
LLVM 12  European Conference, London Reducing Dynamic Compilation Latency Igor Böhm Processor Automated Synthesis by iterative Analysis The University of Edinburgh LLVM 12  European Conference, London
More informationVirtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT Inmemory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationMODELING RANDOMNESS IN NETWORK TRAFFIC
MODELING RANDOMNESS IN NETWORK TRAFFIC  LAVANYA JOSE, INDEPENDENT WORK FALL 11 ADVISED BY PROF. MOSES CHARIKAR ABSTRACT. Sketches are randomized data structures that allow one to record properties of
More informationUnderstanding the Impact of InterThread Cache Interference on ILP in Modern SMT Processors
Journal of InstructionLevel Parallelism 7 (25) 128 Submitted 2/25; published 6/25 Understanding the Impact of InterThread Cache Interference on ILP in Modern SMT Processors Joshua Kihm Alex Settle Andrew
More informationCode Coverage Testing Using Hardware Performance Monitoring Support
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye Matthew Iyer Vijay Janapa Reddi Daniel A. Connors Department of Electrical and Computer Engineering University of Colorado
More informationThe Effect of Input Data on Program Vulnerability
The Effect of Input Data on Program Vulnerability Vilas Sridharan and David R. Kaeli Department of Electrical and Computer Engineering Northeastern University {vilas, kaeli}@ece.neu.edu I. INTRODUCTION
More informationOptimized Resource Allocation in Cloud Environment Based on a Broker Cloud Service Provider
International Journal of Scientific and Research Publications, Volume 3, Issue 5, May 2013 1 Optimized Resource Allocation in Cloud Environment Based on a Broker Cloud Service Provider Jyothi.R.L *, Anilkumar.A
More informationA Methodology for Developing Simple and Robust Power Models Using Performance Monitoring Events
A Methodology for Developing Simple and Robust Power Models Using Performance Monitoring Events Kishore Kumar Pusukuri UC Riverside kishore@cs.ucr.edu David Vengerov Sun Microsystems Laboratories david.vengerov@sun.com
More informationFull Speed Ahead: Detailed Architectural Simulation at NearNative Speed
1 Full Speed Ahead: Detailed Architectural Simulation at NearNative Speed Andreas Sandberg Erik Hagersten David BlackSchaffer Abstract Popular microarchitecture simulators are typically several orders
More informationOptimizing Shared Resource Contention in HPC Clusters
Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs
More informationA Predictive Model for Dynamic Microarchitectural Adaptivity Control
A Predictive Model for Dynamic Microarchitectural Adaptivity Control Christophe Dubach, Timothy M. Jones Members of HiPEAC University of Edinburgh Edwin V. Bonilla NICTA & Australian National University
More informationFast Multipole Method for particle interactions: an open source parallel library component
Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,
More informationFPGAbased Multithreading for InMemory Hash Joins
FPGAbased Multithreading for InMemory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
More informationMCCCSim A Highly Configurable Multi Core Cache Contention Simulator
MCCCSim A Highly Configurable Multi Core Cache Contention Simulator Michael Zwick, Marko Durkovic, Florian Obermeier, Walter Bamberger and Klaus Diepold Lehrstuhl für Datenverarbeitung Technische Universität
More informationMultiprogramming Performance of the Pentium 4 with HyperThreading
In the Third Annual Workshop on Duplicating, Deconstructing and Debunking (WDDD2004) held at ISCA 04. pp 53 62 Multiprogramming Performance of the Pentium 4 with HyperThreading James R. Bulpin and Ian
More informationThe Green Index: A Metric for Evaluating SystemWide Energy Efficiency in HPC Systems
202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric
More informationLCMON Network Traffic Analysis
LCMON Network Traffic Analysis Adam Black Centre for Advanced Internet Architectures, Technical Report 79A Swinburne University of Technology Melbourne, Australia adamblack@swin.edu.au Abstract The Swinburne
More informationCurrent Standard: Mathematical Concepts and Applications Shape, Space, and Measurement Primary
Shape, Space, and Measurement Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two and threedimensional shapes by demonstrating an understanding of:
More informationGlobally Optimal Crowdsourcing Quality Management
Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University akashds@stanford.edu Aditya G. Parameswaran University of Illinois (UIUC) adityagp@illinois.edu Jennifer Widom Stanford
More informationThe Relative Worst Order Ratio for OnLine Algorithms
The Relative Worst Order Ratio for OnLine Algorithms Joan Boyar 1 and Lene M. Favrholdt 2 1 Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark, joan@imada.sdu.dk
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 FamilyBased Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on FamilyBased Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationRevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
More informationPerformance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis
Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work
More informationScalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
More informationMeasuring Cache and Memory Latency and CPU to Memory Bandwidth
White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationEvaluation of ESX Server Under CPU Intensive Workloads
Evaluation of ESX Server Under CPU Intensive Workloads Terry Wilcox Phil Windley, PhD {terryw, windley}@cs.byu.edu Computer Science Department, Brigham Young University Executive Summary Virtual machines
More informationMemory Access Control in Multiprocessor for Realtime Systems with Mixed Criticality
Memory Access Control in Multiprocessor for Realtime Systems with Mixed Criticality Heechul Yun +, Gang Yao +, Rodolfo Pellizzoni *, Marco Caccamo +, Lui Sha + University of Illinois at Urbana and Champaign
More informationEvaluation of the Intel Core i7 Turbo Boost feature
1 Evaluation of the Intel Core i7 Turbo Boost feature James Charles, Preet Jassi, Ananth Narayan S, Abbas Sadat and Alexandra Fedorova Abstract The Intel Core i7 processor code named Nehalem has a novel
More informationEEM 486: Computer Architecture. Lecture 4. Performance
EEM 486: Computer Architecture Lecture 4 Performance EEM 486 Performance Purchasing perspective Given a collection of machines, which has the» Best performance?» Least cost?» Best performance / cost? Design
More informationRethinking SIMD Vectorization for InMemory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for InMemory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
More informationCacheFair Thread Scheduling for Multicore Processors
CacheFair Thread Scheduling for Multicore Processors Alexandra Fedorova, Margo Seltzer and Michael D. Smith Harvard University, Sun Microsystems ABSTRACT We present a new operating system scheduling algorithm
More informationSecure Cloud Computing: The Monitoring Perspective
Secure Cloud Computing: The Monitoring Perspective Peng Liu Penn State University 1 Cloud Computing is Less about Computer Design More about Use of Computing (UoC) CPU, OS, VMM, PL, Parallel computing
More informationBig Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
More informationACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
More informationMULTICORE SCHEDULING BASED ON LEARNING FROM OPTIMIZATION MODELS. George Anderson, Tshilidzi Marwala and Fulufhelo Vincent Nelwamondo
International Journal of Innovative Computing, Information and Control ICIC International c 2013 ISSN 13494198 Volume 9, Number 4, April 2013 pp. 1511 1522 MULTICORE SCHEDULING BASED ON LEARNING FROM
More informationText Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,
More informationAchieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro and even nanoseconds.
More informationChapter 11. 11.1 Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling
Approximation Algorithms Chapter Approximation Algorithms Q. Suppose I need to solve an NPhard problem. What should I do? A. Theory says you're unlikely to find a polytime algorithm. Must sacrifice one
More informationMemory Bandwidth Management for Efficient Performance Isolation in Multicore Platforms
.9/TC.5.5889, IEEE Transactions on Computers Memory Bandwidth Management for Efficient Performance Isolation in Multicore Platforms Heechul Yun, Gang Yao, Rodolfo Pellizzoni, Marco Caccamo, Lui Sha University
More informationSIDN Server Measurements
SIDN Server Measurements Yuri Schaeffer 1, NLnet Labs NLnet Labs document 2010003 July 19, 2010 1 Introduction For future capacity planning SIDN would like to have an insight on the required resources
More informationThe Impact of Memory Subsystem Resource Sharing on Datacenter Applications. Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa
The Impact of Memory Subsystem Resource Sharing on Datacenter Applications Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa Introduction Problem Recent studies into the effects of memory
More informationVoltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via SoftwareGuided Thread Scheduling
Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via SoftwareGuided Thread Scheduling Vijay Janapa Reddi, Svilen Kanev, Wonyoung Kim, Simone Campanoni, Michael D.
More informationSIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs
SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs Fabian Hueske, TU Berlin June 26, 21 1 Review This document is a review report on the paper Towards Proximity Pattern Mining in Large
More informationEfficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.
Algorithms Efficiency of algorithms Computational resources: time and space Best, worst and average case performance How to compare algorithms: machineindependent measure of efficiency Growth rate Complexity
More informationCOLORIS: A Dynamic Cache Partitioning System Using Page Coloring
COLORIS: A Dynamic Cache Partitioning System Using Page Coloring Ying Ye, Richard West, Zhuoqun Cheng, and Ye Li Computer Science Department, Boston University Boston, MA, USA yingy@cs.bu.edu, richwest@cs.bu.edu,
More informationNumerical Matrix Analysis
Numerical Matrix Analysis Lecture Notes #10 Conditioning and / Peter Blomgren, blomgren.peter@gmail.com Department of Mathematics and Statistics Dynamical Systems Group Computational Sciences Research
More informationChapter 5: CPU Scheduling. Operating System Concepts 7 th Edition, Jan 14, 2005
Chapter 5: CPU Scheduling Operating System Concepts 7 th Edition, Jan 14, 2005 Silberschatz, Galvin and Gagne 2005 Outline Basic Concepts Scheduling Criteria Scheduling Algorithms MultipleProcessor Scheduling
More informationADRM: Architectureaware Distributed Resource Management of Virtualized Clusters
ADRM: Architectureaware Distributed Resource Management of Virtualized Clusters Hui Wang, Canturk Isci, Lavanya Subramanian, Jongmoo Choi, Depei Qian, Onur Mutlu Beihang University, IBM Thomas J. Watson
More informationAASH: An AsymmetryAware Scheduler for Hypervisors
AASH: An AsymmetryAware Scheduler for Hypervisors Vahid Kazempour Ali Kamali Alexandra Fedorova Simon Fraser University, Vancouver, Canada {vahid kazempour, ali kamali, fedorova}@sfu.ca Abstract Asymmetric
More informationx64 Servers: Do you want 64 or 32 bit apps with that server?
TMurgent Technologies x64 Servers: Do you want 64 or 32 bit apps with that server? White Paper by Tim Mangan TMurgent Technologies February, 2006 Introduction New servers based on what is generally called
More informationThreads (Ch.4) ! Many software packages are multithreaded. ! A thread is sometimes called a lightweight process
Threads (Ch.4)! Many software packages are multithreaded l Web browser: one thread display images, another thread retrieves data from the network l Word processor: threads for displaying graphics, reading
More informationModeling Virtual Machine Performance: Challenges and Approaches
Modeling Virtual Machine Performance: Challenges and Approaches Omesh Tickoo Ravi Iyer Ramesh Illikkal Don Newell Intel Corporation Intel Corporation Intel Corporation Intel Corporation omesh.tickoo@intel.com
More informationHardware Configuration Guide
Hardware Configuration Guide Contents Contents... 1 Annotation... 1 Factors to consider... 2 Machine Count... 2 Data Size... 2 Data Size Total... 2 Daily Backup Data Size... 2 Unique Data Percentage...
More informationA Study on the Scalability of Hybrid LSDYNA on Multicore Architectures
11 th International LSDYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LSDYNA on Multicore Architectures YihYih Lin HewlettPackard Company Abstract In this paper, the
More informationVirtualizing Performance Asymmetric Multicore Systems
Virtualizing Performance Asymmetric Multi Systems Youngjin Kwon, Changdae Kim, Seungryoul Maeng, and Jaehyuk Huh Computer Science Department, KAIST {yjkwon and cdkim}@calab.kaist.ac.kr, {maeng and jhhuh}@kaist.ac.kr
More informationCooperative Virtual Machine Scheduling on Multicore Multithreading Systems A Feasibility Study
Cooperative Virtual Machine Scheduling on Multicore Multithreading Systems A Feasibility Study Dulcardo Arteaga, Ming Zhao, Chen Liu, Pollawat Thanarungroj, Lichen Weng School of Computing and Information
More informationTable of Contents. June 2010
June 2010 From: StatSoft Analytics White Papers To: Internal release Re: Performance comparison of STATISTICA Version 9 on multicore 64bit machines with current 64bit releases of SAS (Version 9.2) and
More information! Solve problem to optimality. ! Solve problem in polytime. ! Solve arbitrary instances of the problem. !approximation algorithm.
Approximation Algorithms Chapter Approximation Algorithms Q Suppose I need to solve an NPhard problem What should I do? A Theory says you're unlikely to find a polytime algorithm Must sacrifice one of
More informationMapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
More informationSIMS 255 Foundations of Software Design. Complexity and NPcompleteness
SIMS 255 Foundations of Software Design Complexity and NPcompleteness Matt Welsh November 29, 2001 mdw@cs.berkeley.edu 1 Outline Complexity of algorithms Space and time complexity ``Big O'' notation Complexity
More informationEnergyEfficient Virtual Machine Scheduling in PerformanceAsymmetric MultiCore Architectures
EnergyEfficient Virtual Machine Scheduling in PerformanceAsymmetric MultiCore Architectures Yefu Wang 1, Xiaorui Wang 1,2, and Yuan Chen 3 1 University of Tennessee, Knoxville 2 The Ohio State University
More informationThe Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have
More informationA Hybrid Load Balancing Policy underlying Cloud Computing Environment
A Hybrid Load Balancing Policy underlying Cloud Computing Environment S.C. WANG, S.C. TSENG, S.S. WANG*, K.Q. YAN* Chaoyang University of Technology 168, Jifeng E. Rd., Wufeng District, Taichung 41349
More informationContributions to Gang Scheduling
CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance Driven Gang Scheduling,
More informationNext Generation GPU Architecture Codenamed Fermi
Next Generation GPU Architecture Codenamed Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More informationApplying Data Analysis to Big Data Benchmarks. Jazmine Olinger
Applying Data Analysis to Big Data Benchmarks Jazmine Olinger Abstract This paper describes finding accurate and fast ways to simulate Big Data benchmarks. Specifically, using the currently existing simulation
More informationHow Much Power Oversubscription is Safe and Allowed in Data Centers?
How Much Power Oversubscription is Safe and Allowed in Data Centers? Xing Fu, Xiaorui Wang University of Tennessee, Knoxville, TN 37996 The Ohio State University, Columbus, OH 43210 {xfu1, xwang}@eecs.utk.edu
More informationParallelism and Cloud Computing
Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication
More information