Linear-time Modeling of Program Working Set in Shared Cache

Size: px
Start display at page:

Download "Linear-time Modeling of Program Working Set in Shared Cache"

Transcription

1 Linear-time Modeling of Program Working Set in Shared Cache Xiaoya Xiang, Bin Bao, Chen Ding, Yaoqing Gao Computer Science Department, University of Rochester IBM Toronto Software Lab Abstract Many techniques characterize the program working set by the notion of the program footprint, which is the volume of data accessed in a time window. A complete characterization requires measuring data access in all O(n 2 ) windows in an n- element trace. Two recent techniques have significantly reduced the measurement time, but the cost is still too high for real-size workloads. Instead of measuring all footprint sizes, this paper presents a technique for measuring the average footprint size. By confining the analysis to the average rather than the full range, the problem can be solved accurately by a linear-time algorithm. The paper presents the algorithm and evaluates it using the complete suites of 26 SPEC2000 and 29 SPEC2006 benchmarks. The new algorithm is compared against the previously fastest algorithm in both the speed of the measurement and the accuracy of shared-cache performance prediction. Keywords: Footprint, Cache sharing I. INTRODUCTION During a program execution, its working set can be defined as the footprint, which is the volume of data accessed in an execution window. Since the footprint shows the active data usage, it has been used to model resource sharing among concurrent tasks and to improve throughput and enforce fairness, either in memory sharing among multiprogrammed workloads or more recently in cache sharing among multicore workloads. A trace of n data accesses has ( ) n 2 n(n ) 2 distinct windows and therefore n(n ) 2 footprints. Early studies measured program footprints in the shared cache of time-sharing systems. Since applications interact between time quanta, it is sufficient to consider just the windows of a single length the length of a scheduling quantum [20], [22]. On today s multicore systems, however, programs interact continuously. A number of techniques were developed to estimate the footprint in all-length windows, but they did not guarantee the precision of the estimation [2], [3], [6], [8], [9], [2]. A recently published technique called all-window footprint analysis can measure all footprints in O(CKlogM) time, where CK is linear to the length of the trace and M is the volume of data accessed in the trace [23]. For each window length, the analysis shows the maximum size, the minimum size, and the size distribution of footprints in all windows of this length [23]. The analysis is not fully accurate but guarantees a relative precision, e.g. 99%. We call the analysis all-footprint analysis, because it measures the size of every footprint. In this paper, we present average-footprint analysis. For each window length, the analysis shows the average size of footprints in all windows of this length. While the analysis gives the accurate average, it does not measure the range or the distribution. However, a weaker analysis can often be done faster. Indeed, we show that the average footprint can be measured accurately in linear time O(n) for a trace of length n, regardless of the data size. The average footprint is a function mapping from the length of a execution window to the volume of its data access. Intuitively, the working set increases in larger execution windows. We prove that the average footprint is monotonically nondecreasing. The new analysis precisely quantifies the growth of the average footprint over time. The previous, all-footprint analysis was the key metric used in the composable models of cache sharing [3], [6], [2], [23]. For P programs, there are 2 P co-run combinations. A composable model makes 2 P predictions using P singleprogram runs rather than 2 P parallel runs. As an alternative to all-footprint analysis, the new average-footprint analysis can be used in the composable model to reduce the (footprint) measurement cost asymptotically. To evaluate the speed and usefulness of the averagefootprint analysis, we test it on the complete suites of SPEC 2000 and SPEC 2006 CPU benchmarks and compare the results with the fastest all-footprint analysis [23]. To measure the accuracy of cache sharing prediction, we rank the slowdowns in two- and three-program co-runs on a quad-core machine and compare the predicted ranking with exhaustive testing. Through experiments, we show that the average-footprint analysis can predict the effect of cache interference as accurately as the all-footprint analysis, yet at only a fraction of its cost. In fact, the cost of all-footprint analysis is too high for it to model SPEC 2006 benchmarks, which have up to.9 trillion accesses to up to GB data. In comparison, the average-footprint analysis can model all SPEC 2006 benchmarks, finishing most of the programs within a few hours of time. This study has two limitations. First, we are concerned with parallel workloads consisting of only sequential programs that do not share data. We do not consider parallel programs, although similar footprint metrics have been studied to model multi-threaded workloads [6], [7]. Second, the footprint results are input specific, so they are useful mostly in workload characterization, for example, finding the most and the least interference among a set of benchmark programs.

2 rd 5 b) Footprint windows and the cache sharing model: Offline cache relative- sharing models O(TlogN) were pioneered O(CKlogN) by Chandra et al. [3] thread A a bcdefa and Suh precision et al. [2] for algorithm a group of independent algorithmprograms and approx. ft 4 extended for multi-threaded code by Ding and Chilimbi [6], Schuff et al. [7], and thread B O(TN) Jiang et al. O(CKN) [2] Let A, B be two kmmmnon programs accurate that share algorithm the same cachealgorithm but do not shared data, the effect of B on the locality of A is rd rd+ft 9 thread A&B accurate constantprecision with B) a k bcm d m e m f nona P (capacity miss by A when co-running approximation (a) In shared cache, the reuse distance in program A is P ((A s reuse distance + B s footprint) cache size) lengthened by the footprint of program B. Given an execution all-window window in statistics a sequential trace, the prog. B In shared cache, B the and reuse A co-execution distance in thread (b) footprint four algorithms is the number for measuring of distinct footprint elements in accessed all in the A is lengthened by the footprint of thread B. execution windows in a trace. T is the length of window. The examples in Figure (a) illustrates the interaction trace and N the largest abbaadaccc axbybyaxaxdwaxczczcz between locality and footprint. In the first example, a reuse 4 cache misses on 3-element window in program A concurs with a time window in program fully associative LRU cache. B. The reuse distance of A is lengthened by the footprint of prog. A B s window. The second example uses two pairings of three xyyxxwxzzz traces to show that the shared cache miss rate depends also B2 and A co-execution on the footprint, not just the miss rate of co-run threads An implication of the cache sharing model is that cache prog. B2 axbycycxbxcwcxczdzdz interference is asymmetric for programs with different locality 2 cache misses on 3-element abccbcccdd fully associative LRU cache. and footprints. A program with large footprints and short reuse distances may disproportionally slow down other programs (b) Programs B and B2 have the same miss rate. However, A and while experiencing little or no slowdown itself. This was B incur 50% more misses in shared cache than A and B2. The difference is caused not by data reuse but by data footprint. observed in experiments [23], [24]. In one program pair, the first program shows a near 85% slowdown while the other Fig.. Example illustrations of cache sharing. Programs B and B2 have the same miss rate. However, A-B incurs 50% more program shows only a 5% slowdown. misses in shared cache than A-B2. The difference is caused not by data reuse but by footprint. II. BACKGROUND ON OFF-LINE CACHE MODELS Off-line cache models do not improve performance directly but can be used to understand the causes of interference and to predict its effect before running the programs (so they may be grouped to reduce interference). Off-line analysis measures the effect of all data accesses, not just cache misses. It characterizes a single program unperturbed by other programs and the analysis itself. Such clean-room metrics avoid the chicken-egg problem when programs are analyzed together: the interference depends on the miss rate of corunning programs, but their miss rate in turn depends on the interference. Next we describe first the locality model of private cache and then the model for shared cache. a) Reuse windows and the locality model: For each memory access, the temporal locality is determined by its reuse window, which includes all data accesses between this and the previous access to the same datum. Specifically, whether the access is a cache (capacity) miss depends on the reuse distance, the number of distinct data elements accessed in the reuse window. The relation between reuse distance and the miss rate has been well established [9], [6]. The capacity miss rate can be defined by a probability function involving the reuse distance and the cache size. Let the test program be A. P (capacity miss by A alone) P (A s reuse distance cache size) single-window statistics III. THE MEASUREMENT OF AVERAGE FOOTPRINT A. Definitions Let W be the set of ( n 2) windows of a length-n trace. Each window w < t, v > has a length t and a footprint v. Let I(p) be a boolean function returning when p is true and 0 otherwise. The footprint function f p(t) averages over all windows of length t: w fp(t) v i W ii(t i t) w w I(t v i W ii(t i t) i W i t) n t + For example, the trace abbb has 3 windows of length 2: ab, bb, and bb. The corresponding footprints are 2,, and, so fp(2) (2 + + )/3 4/3. B. O(n) Algorithm There is a linear-time algorithm that calculates the precise average footprint for all execution windows of a trace. Let n, m be the length of the trace and the number of distinct data used in the trace. The algorithm first measures the follow three quantities: the distribution of the time distances of all data reuses (n m distances) the first-access times of all distinct data (m access times) the last-access times of all distinct data (exact definition later, m access times) The three quantities can be measured by a single pass over the trace using a hash table with one entry for each distinct data. The cost is linear, O(n) in time and O(m) in space. 2

3 The three measures are the inputs to a formula f p(w). For any window size w(0 < w N), fp(w) computes the average footprint for all windows of size w. In other words, the formula computes the average footprint for windows of all sizes without having to inspect the trace again. In the rest of the section, we derive the formula and discuss its complexity. The main idea of the formula is differential counting, which counts the difference in the footprint between consecutive windows. For any window size w, we start with the footprint in the first window and then compute its increase or decrease as the window moves forward in the trace. The first-access times are sufficient to compute the footprint of the first window. The change in later windows depends on two metrics on each trace element d i : the forward time distance fwd(d i ) and the backward time distance bwd(d i ). Let datum x be accessed at d i. Let the closest accesses of x be d j before d i and d k after d i. Then fwd(d i ) k i and bwd(d i ) i j. The forward and backward time distances determine the change of footprint between consecutive windows. The relation is shown in Figure 2. di-w di-w+ fp(i-w) fp(i-w+) bwd(di) di- di fp(i) fwd(di) di+ fp(i+) di+w- di+w Fig. 2. An illustration how the forward and backward (reuse) time distance influences the change in footprint between consecutive windows Let the footprint of a w-size window starting at i be fp(i). Each element d i in the trace affects the footprint of w windows: fp(i w+), fp(i w+2),..., fp(i). In differential counting, we consider only the effect of d i on two pairs of windows: the change from fp(i w) to fp(i w + ) when d i enters into its first window and the change from fp(i) to fp(i + ) when d i exits from its last window of influence. Figure 2 shows d i and the two pairs of windows where d i enters between the first pair and exits between the second pair. When d i enters, it does not increase the footprint fp(i w) if the same datum was previously accessed within f p(i w +), which means that its backward time distance is no greater than w (bwd(d i ) w). This is the case illustrated in Figure 2. Otherwise, d i adds to the footprint fp(i w). Similarly, when d i exits from fp(i), the departure does not change fp(i + ) if fwd(d i ) w; otherwise, it subtracts from fp(i + ), as in the case illustrated in Figure 2. The footprint f p(i + ) depends on three factors: the footprint fp(i), the contribution of the entering d i+w, and the detraction of the exiting d i. The footprint of all windows is then computed by adding these differences. Next we formulate this computation. We use the following notations. n, m, w: the length of the trace, the size of data, and the window size of interest d i : the i-th trace access fp(i): the footprint of the window from d i to d i+w (including d i and d i+w ) dk bwd(d i ): the backward reuse time distance of d i, if d i is the first access. fwd(d i ): the forward reuse time distance of d i, if d i is the last access. I(p): a boolean function that returns if p is true and 0 otherwise. For example, I(bwd(d i ) > w) gives the contribution by d i, which is if bwd(d) > w and 0 otherwise. Similarly, I(fwd(d i ) > w) gives the detraction of d i, if fwd(d) > w and 0 otherwise. The total size of the footprints in all windows of length w, when divided by the number of windows n w +, is the average footprint, as shown next in Equation. Since fp(w) n w+ X fp(i) () n w + fp(i + ) fp(i) + I(bwd(d i+w ) > w) I(fwd(d i ) > w) (2) Expanding Equation using Equation 2, we have three components in the average footprint: fp(w) fp() + n iw+ n w n w + ( (n i + )I(bwd(d i ) > w) (n i + w)i(fwd(d i ) > w)) (3) Next we compute each component separately. The footprint of the first window of length w is fp() w I(bwd(d i ) ) (4) In the next component, we split the forward time distances into two groups: finite and infinite distances. The summation order of the finite distances can be changed from to n instead of from w + to n. n iw+ n iw+ + n iw+ (n i + )I(bwd(d i ) > w) (5) (n i + )I(w < bwd(d i ) < ) (n i + )I(bwd(d i ) ) n (n i + )I(w < bwd(d i ) < ) + n iw+ (n i + )I(bwd(d i ) ) Similarly, we decompose and simplify the forward distances: 3

4 (n i + w)i(fwd(d i ) > w) (6) n w n (n i + w)i(w < fwd(d i ) < ) n w + (n i + w)i(fwd(d i ) ) Combining the Equations 4, 5, and 6, we can now expand Equation 3. Instead of using individual accesses, we now use the three inputs, defined as follows: f i : the first access time of the i-th datum l i : the reverse last access time of the i-th datum. If the last access is at position x, l i n + x, that is, the first access time in the reverse trace. r t : the number of accesses with a reuse time distance t fp(w) n I(bwd(d i ) ) + n w + ( n (w i)i(bwd(d i ) ) iw+ n w + (n i + w)i(fwd(d i ) ) n (n i + )I(w < bwd(d i ) < ) n (n i + w)i(w < fwd(d i ) < )) M m + n w + ( (w f i )I(f i > w) + M (w l i )I(l i > w) n + (w t)i(t > w)r t ) t m m n w + ( (f i w)i(f i > w) m + (l i w)i(l i > w) + n tw+ (t w)r t ) (7) The formula of Equation 7 passes the sanity check that the average footprint fp(w) is at most the data size m, and the footprint of the whole trace (w n) is m. Fixing the window length w and ignoring the effect of first and last accesses, we see that the footprint decreases if more reuse time distances (r t ) have larger values (t). This suggests that improving locality reduces the average footprint. For example, if we double the length of a trace by repeating each element twice, the length of the long time distances would double, and the average footprint would drop. For each window length w, the Equation 7 can be computed in time O(w). If we limit to consider only window sizes of a logarithmic scale, the formula can be represented and evaluated in O(log w) time. C. Monotonicity Theorem 3.: The average footprint f p(w) is nondecreasing. Proof: Let wi k denotes the i-th window whose size is k, f(wi k ) denotes the footprint of the i-th window whose size is k. We prove that, k, 0 < k n, fp(k + ) fp(k). First, i, 0 < i n k, the following holds because wi k and wi+ k are both contained in wk+ i : f(w k+ i ) f(w k i ) f(w k+ i ) f(w k i+ ) In addition, we have k, 0 < k n, j, 0 < j n k +, such that, f(wj k ) fp(k). Now then fp(k + ) n k f(w k+ i ) n k j n k [ n k f(w k+ i ) + ij j n k [ n k f(wi k ) + ij f(w k+ i )] f(w k i+)] j n k [ n k+ f(wi k ) + f(wi k )] ij+ n k+ n k [ f(wi k ) f(wj k )] n k [(n k + )fp(k) f(wk j )] fp(k) + n k [fp(k) f(wk j )] fp(k) IV. AVERAGE FOOTPRINT IN THE COMPOSABLE MODEL Our previous work used all-footprint analysis in the composable model to predict cache interference [23]. In the composable model, when multiple programs are run together, each reuse distance in a program is lengthened by the aggregate footprint of all peer programs over the same time window. Suppose there are n programs t, t 2,..., t n running on a shared cache, the miss rate is computed by P (capacity miss by t i running with t j, j,..., n, j i ) P ((t i s reuse distance + j i t j s footprint) cache size) Suppose the distribution of program t i s reuse distance is D rd (t i ), and the distribution of program t i s footprint of 4

5 window size w is D fp (t i, w). The first distribution is defined as D rd (t i ) {< x ki, p ki > p k } where < x ki, p ki > means the probability of the reuse distance equals x ki is p ki. Similarly, we define D fp (t i, w) {< yk w i, qk w i > qk w i } Given a window size w, we use < yk w i, qk w i > to mean that the probability that the footprint equals yk w i is qk w i. Consider a 2-program co-run involving t and t 2. The capacity miss rate by t is calculated as follows by Equation 8. mr(t ) k k 2 p k q w(xk ) k 2 I(x k + y w(x k ) k 2 C) (8) where I is the identity function, and w(x k ) is the size of the reuse window that contains the reuse distance x k. This is the equation employed by all-footprint based modeling [23]. To use average-footprint analysis instead, we define the average footprint of a window size w for program t i as F (t i, w) fi w. Equation 8 can be simplified to Equation 9. mr(t ) k p k I(x k + f w(x k ) 2 C) (9) The estimation of the execution time from the miss rate is the same as [23]. The only difference is that the previous model uses all-footprint analysis and Equation 8, and the new model uses average-footprint analysis and the simpler Equation 9. A. Experimental Setup V. EVALUATION We have implemented the average-footprint analysis algorithm in a profiling tool and tested 26 SPEC2K benchmarks, 2 integer and 4 floating-point, and 29 SPEC2006 benchmarks, 2 integer and 7 floating-point. All benchmarks are instrumented by Pin [5] and profiled on a machine with an Intel Core i5-660 processor and 4GB physical memory. The machine is set up with Fedora 3 and GCC The twoprogram co-run results for SPEC 2000 are collected on an Intel Core 2 Duo machine with two 2.0GHz cores sharing 2MB L2 cache and 2GB memory. In order to measure 3-program coruns, we use an Intel quad-core machine, with four 2.27GHz cores sharing 8MB L3 cache and 8GB memory. Except in Section V-G when we examine the effect of input, we use the reference input in the test. Some programs, especially SPEC 2006, have multiple reference inputs. We use the first one tested by the auto-runner. In performance comparisons, the base program run time is one without Pin instrumentation or any other analysis. The length of SPEC 2000 traces ranges from 4 billion in gcc to 425 billion in mgrid. The amount of data ranges from 3 thousand 64-byte cache blocks (MB) in eon to 3.2 million cache blocks (256MB) in gcc. The SPEC 2006 traces on average are 0 times as long as SPEC 2000 traces and have 5 times as many cache blocks. The trace bwaves is the longest with.9 trillion data accesses and has the most data, 928MB. The individual statistics of the 55 programs is listed in Table II. To evaluate cache-sharing predictions, we run two experiments: 2-program co-runs. We predict all 2-program co-runs and compare the predicted ranking with that of the previous work using the 5 SPEC 2000 benchmarks used in the previous work [23]. 3-program co-runs. We started with the 0 representative benchmarks in SPEC2006 as selected by Zhuravlev et al. [29]. Reuse-distance analysis was too slow to measure 2 programs. We evaluate the prediction for all program triples of the remaining 8 programs. In both tests, we also compare with a simple prediction method based on miss rates (by ranking the total miss rate of the programs in the co-run group) [23]. B. Efficiency of Average-footprint Analysis Table I summarizes the analysis cost for the two benchmark suites, and for each suite, the average for integer and for floating-point programs. It divides the 55 tests into four groups: 2 SPEC 2000 integer programs, 4 SPEC 2000 floating-point programs, 2 SPEC 2006 integer programs, and 7 SPEC 2006 floating-point programs. The result of each group is summarized in three rows and three columns. The columns show the trace length, the data size, and the slowdown ratio of the profiling time to the unmodified run time. The rows show the minimum, maximum, and the average slowdown factors for all benchmarks of the group. The minimum slowdowns in four benchmark groups are all below 0. The maximum slowdowns are 40, 32, 4, and 74. The average slowdowns are between 2 in SPEC 2006 integer tests and 29 in SPEC 2000 integer tests. On average across all four groups, average-footprint analysis takes no more than 30 times of the original execution time. The individual results of the 55 programs are shown in Table II. Compared to the summary table, the individual-result table has two additional columns, which show the unmodified execution time and the time of average-footprint analysis. The unmodified time measures the execution of the original program without any instrumentation or analysis. On average, an unmodified SPEC 2000 program takes less than 3 minutes, and an unmodified SPEC 2006 program takes close to 0 minutes. Average-footprint analysis takes 3 to 73 minutes for SPEC 2000 programs and 0 minutes (gcc) to 0 hours (calculix) for SPEC 2006 programs. C. Comparison with All-footprint Analysis All-footprint analysis can analyze SPEC 2000 programs but not SPEC 2006 programs. We compare average- and allfootprint analysis on SPEC 2000 programs in Table III. SPEC 2000 has 26 programs in total. The paper on all-footprint analysis reported results for 5 of the programs [23]. The table summarizes the cost of the two analyses in these 5 tests in the last two columns. The slowdowns by averagefootprint analysis are between 8.8 and 40. The slowdowns by all-footprint analysis are between 248 and 360. The average slowdown is 40 for average-footprint analysis and about 500 5

6 benchmarks stats trace length data size(64b lines) avg-fp slowdown(x) SPEC2000 INT min.4 E E programs max 6.05 E E mean 7.52 E+0.67 E SPEC2000 FP min 3.03 E E programs max E E mean 7.44 E E SPEC2006 INT min 4.88 E E programs max 5.47 E E mean E E SPEC2006 FP min E E programs max E E mean E E TABLE I THE MIN, MAX, AND AVERAGE COSTS (SLOWDOWNS) OF AVERAGE-FOOTPRINT ANALYSIS FOR 55 SPEC 2000 AND SPEC 2006 BENCHMARK PROGRAMS for all-footprint analysis. In other words, on average for these 5 programs, average-footprint analysis is 38 times faster than all-footprint analysis. All-footprint analysis takes too long for SPEC 2006 programs. For example, it takes average-footprint analysis 0 hours to profile calculix. Being 38 times slower, it would take more than two weeks to measure the all-footprint distribution. D. Two-Program Co-run Ranking The prior work showed co-run ranking results for 5 SPEC 2000 programs based on all-footprint analysis and compared with miss-rate based ranking and exhaustive testing [23]. We now show the ranking results using average-footprint analysis and compare it with the three previous ranking methods. We show the prediction results in a 2-D plot. The x-axis is the rank of program co-run groups. In this test, the rank ranges from (the least interfering pair) to 05 (the most interfering pair). The y-axis shows the interference, measured by the quadratic mean of the slowdowns of programs in the co-run group. The slowdown of a co-run program is the ratio of its co-run time and the time running alone on the same machine (cache). For two programs with slowdowns s, s 2, s 2 +s we have y The three graphs in Figure 3 show the plots for the predictions based on miss rate, all-footprint analysis, and averagefootprint analysis. In each plot, the accurate result from exhaustive testing is shown by a monotonically increasing red line as a reference. The simple miss-rate based prediction does not show an increasing trend, suggesting no correlation between the prediction and the actual interference. The two footprint-based predictions show significant correlation. Programs predicted to have a high interference tend to actually have a high interference. Average-footprint analysis ranks several program pairs better than all-footprint analysis. Consider the pair with the highest interference, art,mcf with a slowdown of 2. The pair is ranked 23 by miss rate, 99 by all-footprint analysis, and 05 by average-footprint analysis. The average-footprint rank is precisely correct. All-footprint ranking has a significant misprediction for the program pair gcc,art. The pair slows down each other by.6 times. It should be ranked 97 but ranked 44 by all-footprint analysis, which is worse than the miss-rate rank 70. The rank by average-footprint analysis is relatively the best at 86. E. Three-Program Co-run Ranking Evaluating larger group co-runs is difficult because the number of tests increases exponentially with the size of corun group. To test all 3-program co-runs in SPEC 2006 benchmarks, we would have to run ( ) tests. Even if we ran all the tests, it would have been impossible to show the results clearly. Fortunately, Zhuravlev et al. have analyzed the benchmark set based on the cache miss rates and access rates and identified 0 representatives [29]. We had to narrow down further because the reuse-distance analysis could finish only for 8 out of the 0 representatives: 403.gcc, 46.gamess, 429.mcf, 444.namd, 445.gobmk, 450.soplex, 453.povray, and 470.lbm. There are 56 different 3-program groups from these 8 benchmarks. We show the prediction results in Figure 4. The results for 3-program co-runs of SPEC 2006 programs are similar to those of 2-program co-runs of SPEC 2000 programs. As before, the miss-rate based prediction does not show a detectable correlation while the average-footprint analysis shows a clear correlation with the actual interference. The maximal slowdown increases from 2.0 in 2-program coruns to 3.3 in 3-program co-runs, confirming the expectation that the interference becomes worse as the cache is shared by more programs. Exhaustive testing is also increasingly infeasible. For both reasons, the composable model is more valuable, so is the higher efficiency from the average-footprint analysis. F. Rank and Performance Closeness To quantify the difference between the predicted ranking and the accurate ranking, we define two metrics: the rank closeness and the performance closeness. The rank closeness shows on average how the predicted rank of a co-run group differs from the actual rank. We number n co-run groups by their accurate rank i. Let pred(i) be the predicted rank for group i. The rank closeness is defined as n pred(i) i rank closeness n 6

7 slowdown miss rate exhaustive testing slowdown miss rate exhaustive testing ranked program pairs (from least interference to most interference) ranked program triples (from least interference to most interference) slowdown all footprint exhaustive testing slowdown average footprint exhaustive testing ranked program pairs (from least interference to most interference) ranked program triples (from least interference to most interference) slowdown average footprint exhaustive testing ranked program pairs (from least interference to most interference) Fig. 3. Evaluation of 2-program co-run predictions for 5 SPEC2000 benchmark programs. The prediction quality of average-footprint analysis is similar to that of all-footprint analysis. The formula is the Manhattan distance between two vectors < p(), p(2),..., p(n) > and <, 2,..., n >, divided by n. The worst possible ranking has a rank-closeness score of n/2 if n is even or (n )/2 otherwise. Next we quantify the error in terms of the mis-predicted slowdown. Let f(i) be the slowdown of the co-run group with the accurate rank i, and f(pred(i)) be the slowdown Fig. 4. Comparison of 3-program co-run predictions for 8 SPEC2006 benchmarks. All-footprint analysis cannot model these programs because of its high cost. of the co-run group with the predicted rank i. The difference f(pred(i)) f(i) gives the mis-prediction. The performance closeness is the average mis-prediction for all groups: n f(pred(i)) f(i) performance closeness n The two metrics are shown in Table IV. On average for 2- program co-runs, the miss-rate rank errs by 35 positions, while the footprint-based ranks err by 4 and 5 positions. For 3- program co-runs, the miss-rate rank errs by 9 positions, while the average-footprint rank errs by 6. In terms of performance, the miss-rate based ranking mis-predicts twice as bad as the footprint-based ranking. In search of a closeness metric, we also measured the Levenshtein distance. For two permutations of a set of numbers, the Levenshtein distance measures the number of edits needed to convert one to the other. For the 2-program co-run test, the distance is 03 for miss rate, 97 for all-footprint, and 96 for average-footprint. For the 3-program co-run test, the distance is 54 for miss rate and 48 for average-footprint. Levenshtein is not a good metric since it does not distinguish a ranking that does not show a correlation from rankings that do. 7

8 2-program co-run over 5 SPEC2000 benchmarks ranking strategy perf closeness rank closeness miss rate all-footprint avg-footprint program co-run over 0 SPEC2006 benchmarks ranking strategy perf closeness rank closeness miss rate avg-footprint TABLE IV COMPARE DIFFERENT RANKING STRATEGIES G. The Effect of Input on Average Footprint The footprint of a program execution is affected by the program input just as the length of the execution is affected by the input. An important question for profiling-based techniques is how much the footprints in training runs may differ from those in test runs. In this section, we give a preliminary measure of this difference. Given a set of k executions of the same program, we quantify the variation between the k footprint functions (f i (w)) as follows. First, we compute the average of the average footprints: f(w) k k f i(w). Then we compute the Manhattan distance between the i-th execution and the average as: d i W j fi(wj) f(w j) ) W where W is the number of different window lengths. A Manhattan distance of x% means that on average, the input i s footprint function differs from the average by x% in each window size. Table V shows SPEC2000 and SPEC2006 programs, the number of inputs (provided by the benchmark suite and tested in our experiments), the range of trace lengths and data sizes in these inputs, the smallest and largest Manhattan distances as we just defined. The majority of programs, 20 out of 37, see no more than 30% difference between footprints of different inputs. The minimal difference is less than 20% in all but 5 programs. Note that the effect of the input may be predicted using model fitting based on the input characteristics [26]. This is outside the scope of this paper. VI. RELATED WORK Locality models: Locality in private cache can be modeled by reuse distance, which can be measured with a guaranteed precision in time O(n log 2 m), where n is the length of the trace and m is the size of data [26]. Reuse distance has found many uses in workload characterization and program optimization [26]. There are a number of recent developments. Chauhan and Shei gave a method for static analysis of locality in MATLAB code [4]. Unlike profiling whose results are usually input specific, static analysis can identify and model the effect of program parameters. Most previous models targeted program analysis. Ibrahim and Strohmaier used synthetic probes to emulate the locality of an application for efficient machine characterization [0]. Zhou studied random cache replacement policy and gave a one-pass deterministic traceanalysis algorithm to compute the average miss rate (instead of simulating many times and taking the average) [28]. Finally, Schuff et al. defined multicore reuse distance analysis and improved its efficiency through sampling and parallelization [7]. The sampling was based on a method developed by Zhong and Chang earlier [25]. These techniques are concerned with only reuse windows and cannot measure the footprint in all execution windows, which is the problem addressed in this paper. Off-line cache sharing models: The average working-set size in single-length execution windows such as a scheduling quantum can be computed in linear time. It has been used in studying multi-programmed systems [7], [22]. In a parallel environment such as today s multicore processors, programs interact constantly. The interference in all-length windows has been considered for memory [2] and for cache [3]. Both used the following recursive equation involving the working set and the miss rate. As a window of size w is extended to w +, the change in the working set depends on whether the new access is a miss. Suh et al. assumed linear function growth when window sizes were close [2]. Chandra et al. computed the recursive relation bottom up [3]. The same problem has been solved using statistical inference. Two techniques by Berg and Hagersten (known as StatCache) [2] and by Shen et al. [9] were used to infer cache miss rate from the distribution of reuse times. Berg and Hagersten assumed constant miss rate over time and random cache replacement [2], and Shen et al. assumed a Bernulli process and LRU cache replacement [8], [9]. The latter method was adapted to predict cache interference [2]. A precise prediction was shown useful in an approximately solving the optimal coscheduling problem []. However, none of these method can bound the approximation error. Our earlier work gave the first precise methods for measuring the footprint [5], [23] and an iterative model for the circular effect of cache interference [23]. The linear-time algorithm in this paper computes the average rather than the full distribution and improves the measurement speed by near 40 times yet maintains a similar accuracy in shared-cache locality prediction. On-line models: The miss rate curve has been used for memory partitioning to ensure fairness or maximize throughput in a parallel workload [27]. Similarly, reuse distance has been used for cache partitioning among data objects [4]. Recently, Zhuravlev et al. reviewed four models based on the miss rate and the reuse distance [29]. As on-line models, these techniques did not consider the working set metrics because of the cost. For example, Zhuravlev et al. considered a less accurate model from Chandra et al. because for efficiency it did not require all-window footprints [3]. Zhuravlev et al. showed that cache sharing is one of the factors but not necessarily the major factor [29]. Still, an accurate and fast solution may help to quantify the contribution from cache sharing in the overall interference. Analytical models and streaming analysis: Counting the number of distinct data items has been considered as a 8

9 streaming analysis problem. Space-efficient (less than O(m)) solutions exist to measure frequency moments F 0 (footprint), F (total frequency), F 2, F (most frequent item), and entropy [], [8], [3]. Instead of counting the F 0 moment over the whole trace, we solve the problem of collecting the average F o for all execution windows and focus on reducing the time complexity from O(n 2 ) to linear. Streaming solutions may be combined to further reduce the space requirement of our algorithms. VII. SUMMARY Complete characterization of footprint requires measuring data access in all execution windows. In this paper, we have presented the average footprint as a metric of all-window footprint. The footprint function maps from time to average footprint, We have shown that the average footprint function is monotone and can be used in the composable model to rank cache interference in shared cache without having to test any parallel executions. We have presented a linear-time algorithm for accurately measuring the average footprint. The linear-time algorithm uses differential counting based on the forward and backward reuse time distance. When tested on SPEC CPU 2000 benchmarks, the average-footprint analysis is on average 38 times faster than the previous, all-footprint analysis, yet it shows comparable accuracy in shared-cache locality prediction. The average-footprint analysis was efficient enough to measure the newer, SPEC 2006 benchmarks, but the all-footprint analysis could not. ACKNOWLEDGMENT We would like to thank Tongxin Bai for providing histogram mapping libraries. The presentation has been improved by the suggestions from Xipeng Shen and the systems group at University of Rochester. Xiaoya Xiang and Bin Bao are supported by two IBM Center for Advanced Studies Fellowships. The research is also supported by the National Science Foundation (Contract No. CCF-604, CCF , CNS ). REFERENCES [] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the ACM Symposium on Theory of Computing, pages 20 29, 996. [2] E. Berg and E. Hagersten. Fast data-locality profiling of native execution. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems, pages 69 80, [3] D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In Proceedings of the International Symposium on High-Performance Computer Architecture, pages , [4] A. Chauhan and C.-Y. Shei. Static reuse distances for locality-based optimizations in MATLAB. In International Conference on Supercomputing, pages , 200. [5] C. Ding and T. Chilimbi. All-window profiling of concurrent executions. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, poster paper. [6] C. Ding and T. Chilimbi. A composable model for analyzing locality of multi-threaded programs. Technical Report MSR-TR , Microsoft Research, August [7] B. Falsafi and D. A. Wood. Modeling cost/performance of a parallel computer simulator. ACM Transactions on Modeling and Computer Simulation, 7():04 30, 997. [8] P. Flajolet and G. Martin. Probabilistic counting. In Proceedings of the Symposium on Foundations of Computer Science, 983. [9] M. D. Hill and A. J. Smith. Evaluating associativity in CPU caches. IEEE Transactions on Computers, 38(2):62 630, 989. [0] K. Z. Ibrahim and E. Strohmaier. Characterizing the relation between Apex-Map synthetic probes and reuse distance distributions. Proceedings of the International Conference on Parallel Processing, 0: , 200. [] Y. Jiang, X. Shen, J. Chen, and R. Tripathi. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages , [2] Y. Jiang, E. Z. Zhang, K. Tian, and X. Shen. Is reuse distance applicable to data locality analysis on chip multiprocessors? In Proceedings of the International Conference on Compiler Construction, pages , 200. [3] A. Lall, V. Sekar, M. Ogihara, J. Xu, and H. Zhang. Data streaming algorithms for estimating entropy of network traffic. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems, pages 45 56, [4] Q. Lu, J. Lin, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Soft-OLP: Improving hardware cache performance through softwarecontrolled object-level partitioning. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages , [5] C.-K. Luk et al. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, Chicago, Illinois, June [6] R. L. Mattson, J. Gecsei, D. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM System Journal, 9(2):78 7, 970. [7] D. L. Schuff, M. Kulkarni, and V. S. Pai. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 53 64, 200. [8] X. Shen and J. Shaw. Scalable implementation of efficient locality approximation. In Proceedings of the Workshop on Languages and Compilers for Parallel Computing, pages , [9] X. Shen, J. Shaw, B. Meeker, and C. Ding. Locality approximation using time. In Proceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 55 6, [20] H. S. Stone, J. Turek, and J. L. Wolf. Optimal partitioning of cache memory. IEEE Transactions on Computers, 4(9): , 992. [2] G. E. Suh, S. Devadas, and L. Rudolph. Analytical cache models with applications to cache partitioning. In International Conference on Supercomputing, pages 2, 200. [22] D. Thiébaut and H. S. Stone. Footprints in the cache. ACM Transactions on Computer Systems, 5(4): , 987. [23] X. Xiang, B. Bao, T. Bai, C. Ding, and T. M. Chilimbi. All-window profiling and composable models of cache sharing. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 9 02, 20. [24] X. Zhang, S. Dwarkadas, and K. Shen. Towards practical page coloringbased multi-core cache management. In Proceedings of the EuroSys Conference, [25] Y. Zhong and W. Chang. Sampling-based program locality approximation. In Proceedings of the International Symposium on Memory Management, pages 9 00, [26] Y. Zhong, X. Shen, and C. Ding. Program locality analysis using reuse distance. ACM Transactions on Programming Languages and Systems, 3(6): 39, Aug [27] P. Zhou, V. Pandey, J. Sundaresan, A. Raghuraman, Y. Zhou, and S. Kumar. Dynamic tracking of page miss ratio curve for memory management. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages 77 88, [28] S. Zhou. An efficient simulation algorithm for cache of random replacement policy. In Proceedings of the IFIP International Conference on Network and Parallel Computing, pages 44 54, 200. Springer Lecture Notes in Computer Science No [29] S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing shared resource contention in multicore processors via scheduling. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages 29 42,

10 benchmark trace length data size(64b lines) unmodified time(sec) avg-fp analysis time FP alg cost(x) 64.gzip 3.93 E E vpr 6.26 E E gcc.4 E E mcf 2.29 E E crafty.36 E E parser 6.05 E+0 4. E eon 7.8 E E perlbmk 4.69 E E gap.6 E E vortex 6.45 E E bzip E E twolf 4.74 E E wupwise 5.68 E E swim 9.02 E E mgrid E E applu 8.47 E E mesa 5.85 E+0.37 E galgel E E art 3.03 E E equake 6.87 E E facerec 4.62 E E ammp 4.26 E E lucas 5.07 E E fma3d 6.04 E E sixtrack 4.20 E E apsi 8.69 E E perlbench 2.99 E E bzip E+0.98 E gcc 4.88 E E mcf 2.6 E E gobmk 2.48 E E hmmer E E sjeng 0.99 E E libquantum 5.47 E E h264ref 4.0 E E omnetpp E E astar E E xalancbmk E E bwaves 90.5 E E gamess 44.6 E E milc 5.48 E E zeusmp E E gromacs E E cactusADM E E leslie3d 2.2 E E namd 7.2 E E dealII E E soplex E E povray 67.9 E E calculix E E GemsFDTD E E tonto E E lbm E E wrf E E sphinx E E TABLE II INDIVIDUAL STATISTICS OF THE 55 SPEC2000 AND SPEC2006 TEST PROGRAMS 0

11 5 SPEC2000 Benchmarks trace length data size(64b lines) avg-fp slowdown(x) all-fp slowdown(x) min.4 E E max 6.05 E E mean 8.20 E E TABLE III COMPARISON OF THE MIN, MAX, AND AVERAGE SLOWDOWNS BY AVERAGE-FOOTPRINT ANALYSIS AND BY ALL-FOOTPRINT ANALYSIS. ON AVERAGE, AVERAGE-FOOTPRINT ANALYSIS IS 38 TIMES FASTER. benchmark inputs min n(0 9 ) max n(0 9 ) min m( ) max m( ) min d i max d i 86.crafty ammp gap gzip mesa equake gcc art parser bzip vpr mcf twolf namd gromacs lbm GemsFDTD povray calculix sphinx h264ref cactusADM zeusmp bwaves mcf sjeng bzip perlbench leslie3d gcc milc soplex hmmer libquantum tonto astar gamess omnetpp gobmk TABLE V SIMILARITY OF THE FOOTPRINT IN DIFFERENT EXECUTIONS OF THE 37 SPEC2K/2006 BENCHMARKS AS MEASURED BY THE MAX AND MIN MANHATTAN DISTANCE (max d i, min d i )

All-Window Profiling and Composable Models of Cache Sharing

All-Window Profiling and Composable Models of Cache Sharing All-Window Profiling and Composable Models of Cache Sharing Xiaoya Xiang Bin Bao Tongxin Bai Chen Ding Computer Science Department University of Rochester Rochester, NY 14627 {xiang,bao,bai,cding}@cs.rochester.edu

More information

Cache Conscious Task Regrouping on Multicore Processors

Cache Conscious Task Regrouping on Multicore Processors Cache Conscious Task Regrouping on Multicore Processors Xiaoya Xiang, Bin Bao, Chen Ding and Kai Shen Department of Computer Science, University of Rochester Rochester, NY, USA {xiang, bao, cding, kshen}@cs.rochester.edu

More information

An OS-oriented performance monitoring tool for multicore systems

An OS-oriented performance monitoring tool for multicore systems An OS-oriented performance monitoring tool for multicore systems J.C. Sáez, J. Casas, A. Serrano, R. Rodríguez-Rodríguez, F. Castro, D. Chaver, M. Prieto-Matias Department of Computer Architecture Complutense

More information

Achieving QoS in Server Virtualization

Achieving QoS in Server Virtualization Achieving QoS in Server Virtualization Intel Platform Shared Resource Monitoring/Control in Xen Chao Peng (chao.p.peng@intel.com) 1 Increasing QoS demand in Server Virtualization Data center & Cloud infrastructure

More information

How Much Power Oversubscription is Safe and Allowed in Data Centers?

How Much Power Oversubscription is Safe and Allowed in Data Centers? How Much Power Oversubscription is Safe and Allowed in Data Centers? Xing Fu 1,2, Xiaorui Wang 1,2, Charles Lefurgy 3 1 EECS @ University of Tennessee, Knoxville 2 ECE @ The Ohio State University 3 IBM

More information

Performance Characterization of SPEC CPU2006 Integer Benchmarks on x86-64 64 Architecture

Performance Characterization of SPEC CPU2006 Integer Benchmarks on x86-64 64 Architecture Performance Characterization of SPEC CPU2006 Integer Benchmarks on x86-64 64 Architecture Dong Ye David Kaeli Northeastern University Joydeep Ray Christophe Harle AMD Inc. IISWC 2006 1 Outline Motivation

More information

systems for program behavior analysis. 2 Programming systems is an expanded term for compilers, referring to both static and dynamic

systems for program behavior analysis. 2 Programming systems is an expanded term for compilers, referring to both static and dynamic Combining Locality Analysis with Online Proactive Job Co-Scheduling in Chip Multiprocessors Yunlian Jiang Kai Tian Xipeng Shen Computer Science Department The College of William and Mary, Williamsburg,

More information

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking Kathlene Hurt and Eugene John Department of Electrical and Computer Engineering University of Texas at San Antonio

More information

Schedulability Analysis for Memory Bandwidth Regulated Multicore Real-Time Systems

Schedulability Analysis for Memory Bandwidth Regulated Multicore Real-Time Systems Schedulability for Memory Bandwidth Regulated Multicore Real-Time Systems Gang Yao, Heechul Yun, Zheng Pei Wu, Rodolfo Pellizzoni, Marco Caccamo, Lui Sha University of Illinois at Urbana-Champaign, USA.

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 6 Fundamentals in Performance Evaluation Computer Architecture Part 6 page 1 of 22 Prof. Dr. Uwe Brinkschulte,

More information

Virtualisierung im HPC-Kontext - Werkstattbericht

Virtualisierung im HPC-Kontext - Werkstattbericht Center for Information Services and High Performance Computing (ZIH) Virtualisierung im HPC-Kontext - Werkstattbericht ZKI Tagung AK Supercomputing 19. Oktober 2015 Ulf Markwardt +49 351-463 33640 ulf.markwardt@tu-dresden.de

More information

CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate:

CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate: CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate: Clock cycle where: Clock rate = 1 / clock cycle f = 1 /C

More information

Towards Architecture Independent Metrics for Multicore Performance Analysis

Towards Architecture Independent Metrics for Multicore Performance Analysis Towards Architecture Independent Metrics for Multicore Performance Analysis Milind Kulkarni, Vijay Pai, and Derek Schuff School of Electrical and Computer Engineering Purdue University {milind, vpai, dschuff}@purdue.edu

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Department of Electrical Engineering Princeton University {carolewu, mrm}@princeton.edu Abstract

More information

Characterizing the Unique and Diverse Behaviors in Existing and Emerging General-Purpose and Domain-Specific Benchmark Suites

Characterizing the Unique and Diverse Behaviors in Existing and Emerging General-Purpose and Domain-Specific Benchmark Suites Characterizing the Unique and Diverse Behaviors in Existing and Emerging General-Purpose and Domain-Specific Benchmark Suites Kenneth Hoste Lieven Eeckhout ELIS Department, Ghent University Sint-Pietersnieuwstraat

More information

TPCalc : a throughput calculator for computer architecture studies

TPCalc : a throughput calculator for computer architecture studies TPCalc : a throughput calculator for computer architecture studies Pierre Michaud Stijn Eyerman Wouter Rogiest IRISA/INRIA Ghent University Ghent University pierre.michaud@inria.fr Stijn.Eyerman@elis.UGent.be

More information

Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures

Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov School of Computing Science Simon Fraser University Vancouver, Canada dsa5@cs.sfu.ca Alexandra Fedorova School

More information

Intel Pentium 4 Processor on 90nm Technology

Intel Pentium 4 Processor on 90nm Technology Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended

More information

Running Virtual Machines in a Slurm Batch System

Running Virtual Machines in a Slurm Batch System Center for Information Services and High Performance Computing (ZIH) Running Virtual Machines in a Slurm Batch System Slurm User Group Meeting September 2015 Ulf Markwardt +49 351-463 33640 ulf.markwardt@tu-dresden.de

More information

Dynamic Virtual Machine Scheduling in Clouds for Architectural Shared Resources

Dynamic Virtual Machine Scheduling in Clouds for Architectural Shared Resources Dynamic Virtual Machine Scheduling in Clouds for Architectural Shared Resources JeongseobAhn,Changdae Kim, JaeungHan,Young-ri Choi,and JaehyukHuh KAIST UNIST {jeongseob, cdkim, juhan, and jhuh}@calab.kaist.ac.kr

More information

FACT: a Framework for Adaptive Contention-aware Thread migrations

FACT: a Framework for Adaptive Contention-aware Thread migrations FACT: a Framework for Adaptive Contention-aware Thread migrations Kishore Kumar Pusukuri Department of Computer Science and Engineering University of California, Riverside, CA 92507. kishore@cs.ucr.edu

More information

Addressing Shared Resource Contention in Multicore Processors via Scheduling

Addressing Shared Resource Contention in Multicore Processors via Scheduling Addressing Shared Resource Contention in Multicore Processors via Scheduling Sergey Zhuravlev Sergey Blagodurov Alexandra Fedorova School of Computing Science, Simon Fraser University, Vancouver, Canada

More information

Preventing Denial-of-Service Attacks in Shared CMP Caches

Preventing Denial-of-Service Attacks in Shared CMP Caches Preventing Denial-of-Service Attacks in Shared CMP Caches Georgios Keramidas, Pavlos Petoumenos, Stefanos Kaxiras, Alexandros Antonopoulos, and Dimitrios Serpanos Department of Electrical and Computer

More information

Compiler-Assisted Binary Parsing

Compiler-Assisted Binary Parsing Compiler-Assisted Binary Parsing Tugrul Ince tugrul@cs.umd.edu PD Week 2012 26 27 March 2012 Parsing Binary Files Binary analysis is common for o Performance modeling o Computer security o Maintenance

More information

Offline sorting buffers on Line

Offline sorting buffers on Line Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

More information

On the Importance of Thread Placement on Multicore Architectures

On the Importance of Thread Placement on Multicore Architectures On the Importance of Thread Placement on Multicore Architectures HPCLatAm 2011 Keynote Cordoba, Argentina August 31, 2011 Tobias Klug Motivation: Many possibilities can lead to non-deterministic runtimes...

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Reducing Dynamic Compilation Latency

Reducing Dynamic Compilation Latency LLVM 12 - European Conference, London Reducing Dynamic Compilation Latency Igor Böhm Processor Automated Synthesis by iterative Analysis The University of Edinburgh LLVM 12 - European Conference, London

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

MODELING RANDOMNESS IN NETWORK TRAFFIC

MODELING RANDOMNESS IN NETWORK TRAFFIC MODELING RANDOMNESS IN NETWORK TRAFFIC - LAVANYA JOSE, INDEPENDENT WORK FALL 11 ADVISED BY PROF. MOSES CHARIKAR ABSTRACT. Sketches are randomized data structures that allow one to record properties of

More information

Virtuoso and Database Scalability

Virtuoso and Database Scalability Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of

More information

The Effect of Input Data on Program Vulnerability

The Effect of Input Data on Program Vulnerability The Effect of Input Data on Program Vulnerability Vilas Sridharan and David R. Kaeli Department of Electrical and Computer Engineering Northeastern University {vilas, kaeli}@ece.neu.edu I. INTRODUCTION

More information

Optimizing Shared Resource Contention in HPC Clusters

Optimizing Shared Resource Contention in HPC Clusters Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs

More information

Optimized Resource Allocation in Cloud Environment Based on a Broker Cloud Service Provider

Optimized Resource Allocation in Cloud Environment Based on a Broker Cloud Service Provider International Journal of Scientific and Research Publications, Volume 3, Issue 5, May 2013 1 Optimized Resource Allocation in Cloud Environment Based on a Broker Cloud Service Provider Jyothi.R.L *, Anilkumar.A

More information

Understanding the Impact of Inter-Thread Cache Interference on ILP in Modern SMT Processors

Understanding the Impact of Inter-Thread Cache Interference on ILP in Modern SMT Processors Journal of Instruction-Level Parallelism 7 (25) 1-28 Submitted 2/25; published 6/25 Understanding the Impact of Inter-Thread Cache Interference on ILP in Modern SMT Processors Joshua Kihm Alex Settle Andrew

More information

Full Speed Ahead: Detailed Architectural Simulation at Near-Native Speed

Full Speed Ahead: Detailed Architectural Simulation at Near-Native Speed 1 Full Speed Ahead: Detailed Architectural Simulation at Near-Native Speed Andreas Sandberg Erik Hagersten David Black-Schaffer Abstract Popular microarchitecture simulators are typically several orders

More information

A Methodology for Developing Simple and Robust Power Models Using Performance Monitoring Events

A Methodology for Developing Simple and Robust Power Models Using Performance Monitoring Events A Methodology for Developing Simple and Robust Power Models Using Performance Monitoring Events Kishore Kumar Pusukuri UC Riverside kishore@cs.ucr.edu David Vengerov Sun Microsystems Laboratories david.vengerov@sun.com

More information

A Predictive Model for Dynamic Microarchitectural Adaptivity Control

A Predictive Model for Dynamic Microarchitectural Adaptivity Control A Predictive Model for Dynamic Microarchitectural Adaptivity Control Christophe Dubach, Timothy M. Jones Members of HiPEAC University of Edinburgh Edwin V. Bonilla NICTA & Australian National University

More information

Code Coverage Testing Using Hardware Performance Monitoring Support

Code Coverage Testing Using Hardware Performance Monitoring Support Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye Matthew Iyer Vijay Janapa Reddi Daniel A. Connors Department of Electrical and Computer Engineering University of Colorado

More information

MCCCSim A Highly Configurable Multi Core Cache Contention Simulator

MCCCSim A Highly Configurable Multi Core Cache Contention Simulator MCCCSim A Highly Configurable Multi Core Cache Contention Simulator Michael Zwick, Marko Durkovic, Florian Obermeier, Walter Bamberger and Klaus Diepold Lehrstuhl für Datenverarbeitung Technische Universität

More information

FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

More information

Multiprogramming Performance of the Pentium 4 with Hyper-Threading

Multiprogramming Performance of the Pentium 4 with Hyper-Threading In the Third Annual Workshop on Duplicating, Deconstructing and Debunking (WDDD2004) held at ISCA 04. pp 53 62 Multiprogramming Performance of the Pentium 4 with Hyper-Threading James R. Bulpin and Ian

More information

Globally Optimal Crowdsourcing Quality Management

Globally Optimal Crowdsourcing Quality Management Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University akashds@stanford.edu Aditya G. Parameswaran University of Illinois (UIUC) adityagp@illinois.edu Jennifer Widom Stanford

More information

LCMON Network Traffic Analysis

LCMON Network Traffic Analysis LCMON Network Traffic Analysis Adam Black Centre for Advanced Internet Architectures, Technical Report 79A Swinburne University of Technology Melbourne, Australia adamblack@swin.edu.au Abstract The Swinburne

More information

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems 202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric

More information

Fast Multipole Method for particle interactions: an open source parallel library component

Fast Multipole Method for particle interactions: an open source parallel library component Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,

More information

The Relative Worst Order Ratio for On-Line Algorithms

The Relative Worst Order Ratio for On-Line Algorithms The Relative Worst Order Ratio for On-Line Algorithms Joan Boyar 1 and Lene M. Favrholdt 2 1 Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark, joan@imada.sdu.dk

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Evaluation of ESX Server Under CPU Intensive Workloads

Evaluation of ESX Server Under CPU Intensive Workloads Evaluation of ESX Server Under CPU Intensive Workloads Terry Wilcox Phil Windley, PhD {terryw, windley}@cs.byu.edu Computer Science Department, Brigham Young University Executive Summary Virtual machines

More information

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work

More information

Evaluation of the Intel Core i7 Turbo Boost feature

Evaluation of the Intel Core i7 Turbo Boost feature 1 Evaluation of the Intel Core i7 Turbo Boost feature James Charles, Preet Jassi, Ananth Narayan S, Abbas Sadat and Alexandra Fedorova Abstract The Intel Core i7 processor code named Nehalem has a novel

More information

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011 Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

More information

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Measuring Cache and Memory Latency and CPU to Memory Bandwidth White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary

More information

Cache-Fair Thread Scheduling for Multicore Processors

Cache-Fair Thread Scheduling for Multicore Processors Cache-Fair Thread Scheduling for Multicore Processors Alexandra Fedorova, Margo Seltzer and Michael D. Smith Harvard University, Sun Microsystems ABSTRACT We present a new operating system scheduling algorithm

More information

MULTICORE SCHEDULING BASED ON LEARNING FROM OPTIMIZATION MODELS. George Anderson, Tshilidzi Marwala and Fulufhelo Vincent Nelwamondo

MULTICORE SCHEDULING BASED ON LEARNING FROM OPTIMIZATION MODELS. George Anderson, Tshilidzi Marwala and Fulufhelo Vincent Nelwamondo International Journal of Innovative Computing, Information and Control ICIC International c 2013 ISSN 1349-4198 Volume 9, Number 4, April 2013 pp. 1511 1522 MULTICORE SCHEDULING BASED ON LEARNING FROM

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

Secure Cloud Computing: The Monitoring Perspective

Secure Cloud Computing: The Monitoring Perspective Secure Cloud Computing: The Monitoring Perspective Peng Liu Penn State University 1 Cloud Computing is Less about Computer Design More about Use of Computing (UoC) CPU, OS, VMM, PL, Parallel computing

More information

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary Shape, Space, and Measurement- Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two- and three-dimensional shapes by demonstrating an understanding of:

More information

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-Guided Thread Scheduling

Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-Guided Thread Scheduling Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-Guided Thread Scheduling Vijay Janapa Reddi, Svilen Kanev, Wonyoung Kim, Simone Campanoni, Michael D.

More information

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs Fabian Hueske, TU Berlin June 26, 21 1 Review This document is a review report on the paper Towards Proximity Pattern Mining in Large

More information

Memory Access Control in Multiprocessor for Real-time Systems with Mixed Criticality

Memory Access Control in Multiprocessor for Real-time Systems with Mixed Criticality Memory Access Control in Multiprocessor for Real-time Systems with Mixed Criticality Heechul Yun +, Gang Yao +, Rodolfo Pellizzoni *, Marco Caccamo +, Lui Sha + University of Illinois at Urbana and Champaign

More information

SIDN Server Measurements

SIDN Server Measurements SIDN Server Measurements Yuri Schaeffer 1, NLnet Labs NLnet Labs document 2010-003 July 19, 2010 1 Introduction For future capacity planning SIDN would like to have an insight on the required resources

More information

COLORIS: A Dynamic Cache Partitioning System Using Page Coloring

COLORIS: A Dynamic Cache Partitioning System Using Page Coloring COLORIS: A Dynamic Cache Partitioning System Using Page Coloring Ying Ye, Richard West, Zhuoqun Cheng, and Ye Li Computer Science Department, Boston University Boston, MA, USA yingy@cs.bu.edu, richwest@cs.bu.edu,

More information

EEM 486: Computer Architecture. Lecture 4. Performance

EEM 486: Computer Architecture. Lecture 4. Performance EEM 486: Computer Architecture Lecture 4 Performance EEM 486 Performance Purchasing perspective Given a collection of machines, which has the» Best performance?» Least cost?» Best performance / cost? Design

More information

Rethinking SIMD Vectorization for In-Memory Databases

Rethinking SIMD Vectorization for In-Memory Databases SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest

More information

Memory Bandwidth Management for Efficient Performance Isolation in Multi-core Platforms

Memory Bandwidth Management for Efficient Performance Isolation in Multi-core Platforms .9/TC.5.5889, IEEE Transactions on Computers Memory Bandwidth Management for Efficient Performance Isolation in Multi-core Platforms Heechul Yun, Gang Yao, Rodolfo Pellizzoni, Marco Caccamo, Lui Sha University

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Chapter 11. 11.1 Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Chapter 11. 11.1 Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling Approximation Algorithms Chapter Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should I do? A. Theory says you're unlikely to find a poly-time algorithm. Must sacrifice one

More information

A-DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters

A-DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters A-DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters Hui Wang, Canturk Isci, Lavanya Subramanian, Jongmoo Choi, Depei Qian, Onur Mutlu Beihang University, IBM Thomas J. Watson

More information

Cooperative Virtual Machine Scheduling on Multi-core Multi-threading Systems A Feasibility Study

Cooperative Virtual Machine Scheduling on Multi-core Multi-threading Systems A Feasibility Study Cooperative Virtual Machine Scheduling on Multi-core Multi-threading Systems A Feasibility Study Dulcardo Arteaga, Ming Zhao, Chen Liu, Pollawat Thanarungroj, Lichen Weng School of Computing and Information

More information

AASH: An Asymmetry-Aware Scheduler for Hypervisors

AASH: An Asymmetry-Aware Scheduler for Hypervisors AASH: An Asymmetry-Aware Scheduler for Hypervisors Vahid Kazempour Ali Kamali Alexandra Fedorova Simon Fraser University, Vancouver, Canada {vahid kazempour, ali kamali, fedorova}@sfu.ca Abstract Asymmetric

More information

Contributions to Gang Scheduling

Contributions to Gang Scheduling CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,

More information

Threads (Ch.4) ! Many software packages are multi-threaded. ! A thread is sometimes called a lightweight process

Threads (Ch.4) ! Many software packages are multi-threaded. ! A thread is sometimes called a lightweight process Threads (Ch.4)! Many software packages are multi-threaded l Web browser: one thread display images, another thread retrieves data from the network l Word processor: threads for displaying graphics, reading

More information

The Impact of Memory Subsystem Resource Sharing on Datacenter Applications. Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa

The Impact of Memory Subsystem Resource Sharing on Datacenter Applications. Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa The Impact of Memory Subsystem Resource Sharing on Datacenter Applications Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa Introduction Problem Recent studies into the effects of memory

More information

Efficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.

Efficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case. Algorithms Efficiency of algorithms Computational resources: time and space Best, worst and average case performance How to compare algorithms: machine-independent measure of efficiency Growth rate Complexity

More information

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures 11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the

More information

Chapter 5: CPU Scheduling. Operating System Concepts 7 th Edition, Jan 14, 2005

Chapter 5: CPU Scheduling. Operating System Concepts 7 th Edition, Jan 14, 2005 Chapter 5: CPU Scheduling Operating System Concepts 7 th Edition, Jan 14, 2005 Silberschatz, Galvin and Gagne 2005 Outline Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling

More information

Numerical Matrix Analysis

Numerical Matrix Analysis Numerical Matrix Analysis Lecture Notes #10 Conditioning and / Peter Blomgren, blomgren.peter@gmail.com Department of Mathematics and Statistics Dynamical Systems Group Computational Sciences Research

More information

Table of Contents. June 2010

Table of Contents. June 2010 June 2010 From: StatSoft Analytics White Papers To: Internal release Re: Performance comparison of STATISTICA Version 9 on multi-core 64-bit machines with current 64-bit releases of SAS (Version 9.2) and

More information

SIMS 255 Foundations of Software Design. Complexity and NP-completeness

SIMS 255 Foundations of Software Design. Complexity and NP-completeness SIMS 255 Foundations of Software Design Complexity and NP-completeness Matt Welsh November 29, 2001 mdw@cs.berkeley.edu 1 Outline Complexity of algorithms Space and time complexity ``Big O'' notation Complexity

More information

Energy-Efficient Virtual Machine Scheduling in Performance-Asymmetric Multi-Core Architectures

Energy-Efficient Virtual Machine Scheduling in Performance-Asymmetric Multi-Core Architectures Energy-Efficient Virtual Machine Scheduling in Performance-Asymmetric Multi-Core Architectures Yefu Wang 1, Xiaorui Wang 1,2, and Yuan Chen 3 1 University of Tennessee, Knoxville 2 The Ohio State University

More information

How Much Power Oversubscription is Safe and Allowed in Data Centers?

How Much Power Oversubscription is Safe and Allowed in Data Centers? How Much Power Oversubscription is Safe and Allowed in Data Centers? Xing Fu, Xiaorui Wang University of Tennessee, Knoxville, TN 37996 The Ohio State University, Columbus, OH 43210 {xfu1, xwang}@eecs.utk.edu

More information

A Hybrid Load Balancing Policy underlying Cloud Computing Environment

A Hybrid Load Balancing Policy underlying Cloud Computing Environment A Hybrid Load Balancing Policy underlying Cloud Computing Environment S.C. WANG, S.C. TSENG, S.S. WANG*, K.Q. YAN* Chaoyang University of Technology 168, Jifeng E. Rd., Wufeng District, Taichung 41349

More information

x64 Servers: Do you want 64 or 32 bit apps with that server?

x64 Servers: Do you want 64 or 32 bit apps with that server? TMurgent Technologies x64 Servers: Do you want 64 or 32 bit apps with that server? White Paper by Tim Mangan TMurgent Technologies February, 2006 Introduction New servers based on what is generally called

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

Modeling Virtual Machine Performance: Challenges and Approaches

Modeling Virtual Machine Performance: Challenges and Approaches Modeling Virtual Machine Performance: Challenges and Approaches Omesh Tickoo Ravi Iyer Ramesh Illikkal Don Newell Intel Corporation Intel Corporation Intel Corporation Intel Corporation omesh.tickoo@intel.com

More information

Applying Data Analysis to Big Data Benchmarks. Jazmine Olinger

Applying Data Analysis to Big Data Benchmarks. Jazmine Olinger Applying Data Analysis to Big Data Benchmarks Jazmine Olinger Abstract This paper describes finding accurate and fast ways to simulate Big Data benchmarks. Specifically, using the currently existing simulation

More information

Data Backup and Archiving with Enterprise Storage Systems

Data Backup and Archiving with Enterprise Storage Systems Data Backup and Archiving with Enterprise Storage Systems Slavjan Ivanov 1, Igor Mishkovski 1 1 Faculty of Computer Science and Engineering Ss. Cyril and Methodius University Skopje, Macedonia slavjan_ivanov@yahoo.com,

More information

When Prefetching Works, When It Doesn t, and Why

When Prefetching Works, When It Doesn t, and Why When Prefetching Works, When It Doesn t, and Why JAEKYU LEE, HYESOON KIM, and RICHARD VUDUC, Georgia Institute of Technology In emerging and future high-end processor systems, tolerating increasing cache

More information

Joint Optimization of Overlapping Phases in MapReduce

Joint Optimization of Overlapping Phases in MapReduce Joint Optimization of Overlapping Phases in MapReduce Minghong Lin, Li Zhang, Adam Wierman, Jian Tan Abstract MapReduce is a scalable parallel computing framework for big data processing. It exhibits multiple

More information

Empirical Examination of a Collaborative Web Application

Empirical Examination of a Collaborative Web Application Empirical Examination of a Collaborative Web Application Christopher Stewart (stewart@cs.rochester.edu), Matthew Leventi (mlventi@gmail.com), and Kai Shen (kshen@cs.rochester.edu) University of Rochester

More information

Report Paper: MatLab/Database Connectivity

Report Paper: MatLab/Database Connectivity Report Paper: MatLab/Database Connectivity Samuel Moyle March 2003 Experiment Introduction This experiment was run following a visit to the University of Queensland, where a simulation engine has been

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

A Comparison of General Approaches to Multiprocessor Scheduling

A Comparison of General Approaches to Multiprocessor Scheduling A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA jing@jolt.mt.att.com Michael A. Palis Department of Computer Science Rutgers University

More information

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation: CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm

More information

Virtualizing Performance Asymmetric Multi-core Systems

Virtualizing Performance Asymmetric Multi-core Systems Virtualizing Performance Asymmetric Multi- Systems Youngjin Kwon, Changdae Kim, Seungryoul Maeng, and Jaehyuk Huh Computer Science Department, KAIST {yjkwon and cdkim}@calab.kaist.ac.kr, {maeng and jhhuh}@kaist.ac.kr

More information