Preventing Denial-of-Service Attacks in Shared CMP Caches

Size: px
Start display at page:

Download "Preventing Denial-of-Service Attacks in Shared CMP Caches"

Transcription

1 Preventing Denial-of-Service Attacks in Shared CMP Caches Georgios Keramidas, Pavlos Petoumenos, Stefanos Kaxiras, Alexandros Antonopoulos, and Dimitrios Serpanos Department of Electrical and Computer Engineering, University of Patras, Patras, Greece Abstract. Denial-of-Service (DoS) attacks try to exhaust some shared resources (e.g. process tables, functional units) of a service-centric provider. As Chip Multi-Processors (CMPs) are becoming mainstream architecture for server class processors, the need to manage on-chip resources in a way that can provide QoS guarantees becomes a necessity. Shared resources in CMPs typically include L2 cache memory. In this paper, we explore the problem of managing the on-chip shared caches in a CMP workstation where malicious threads or just cache hungry threads try to hog the cache giving rise to DoS opportunities. An important characteristic of our method is that there is no need to distinguish between malicious and healthy threads. The proposed methodology is based on a statistical model of a shared cache that can be fed with run-time information and accurately describe the behavior of the shared threads. Using this information, we are able to understand which thread (malicious or not) can be compressed into less space with negligible damage and to drive accordingly the underlying replacement policy of the cache. Our results show that the proposed attack-resistant replacement algorithm can be used to enforce high-level policies such as policies that try to maximize the usefulness of the cache real estate or assign custom space-allocation policies based on external QoS needs. Introduction In application domains that range from information access to electronic commerce, many services are susceptible to attacks by malicious clients that can significantly degrade their performance. One kind of attack, called Denial-of-Service (DoS) attack, is a malicious attempt by a single person or a group of people to cripple an online service. This can have serious consequences for companies such as Amazon and ebay which rely on their online availability to do business. In the past, many companies fell victim to DoS attacks resulting in a damage of million of dollars [4][5]. Moreover, service providers may be forced by the customer requirements to provide specific QoS guarantees. In this case, the providers must assure the service quality of their services by assigning a specific amount of resources (i.e. CPU cycles). On the architecture front, processor designers are fast moving towards multiple cores on a chip to achieve new levels of performance. The target is to hide as much as possible S. Vassiliadis et al. (Eds.): SAMOS 26, LNCS 47, pp , 26. Springer-Verlag Berlin Heidelberg 26

2 36 G. Keramidas et al. the long memory latencies. CMPs are becoming the dominant architecture for many server class machines [8][9][]. For reasons of efficiency and economy of processor area, the sharing of some chip resources is a necessity. The shared resources in CMPs typically include the lower level caches. Those shared resources in CMPs create a need for fair and efficient management policies. A trivial solution would be to statically partition the shared resources among the running threads. However, this design point is inefficient in resource utilization when the demand is not uniform. From another point of view, having caches shared between threads provides a vastly more dangerous avenue of attack a DoS attack [6]. A malicious application can abuse the shared cache rendering the whole system practically inoperative, since the L2s are a critical element in the performance of all modern computers. Furthermore, according to [8], even through a DoS attack is usually intentional and malicious, such types of attacks can sometime happen accidentally. For example, one person running a memory or CPU intensive program in a multiuser machine can cause all the other users of the system to experience an extreme slowdown even if the running program is not by nature malicious. Furthermore, poor programming, either in choice of algorithm or in implementation, can also cause programs to consume resources disproportionately. This is in accordance to the problem of attack detection: sometimes it is impossible to distinguish between memory or CPU intensive applications from DoS attacks, since they operate indentically. Hence, a desirable characteristic of all the methods against DoS is to manipulate the system threads in a fair and/or efficient manner without the need to distinguish between malicious and normal threads. To model and understand cache sharing we have built a new theoretical framework that accurately describes applications interplay in shared caches. Our cache model, named StatShare, is derived from the StatCache statistical cache model [6], which yields the miss ratio of an application for any cache size. While the StatCache model uses the number of memory references as its unit of time, StatShare uses the number of cache replacements at the studied cache level [4] as the unit of time. This allows for a natural mapping of the cache statistics to the shared cache level. This further leads to a very efficient implementation of the StatShare which enables on-line analysis feeding a dynamic resource scheduler. StatShare can predict miss rate with great success as a function of the active cache ratio used by an application. We also demonstrate how online StatShare results can be used as inputs to a resource sched-uler. We model and evaluate a cache resource sharing strategy based on Cache Decay, originally proposed for leakage reduction [7]. Our proposal introduces important differences. A decayed cacheline is simply available for replacement rather than turned-off for leakage. Thus, hits on decayed lines are allowed. Secondly, the decay interval is measured not in cy-cles but in CAT time. Our modified attack resistant cache replacement algorithm has the added advantage that it does not need to classify a thread (client) as malicious or not malicious permanently, but instead computes this based on recent behavior. Hence, our algorithm performs a kind of dynamic check on thread s behavior. This is an important feature, since it is possible that a normal thread may be misclassified as malicious, through this classification will change with time. As an example, a thread that has poor locality may have a low hit rate (and try to hog the cache), resulting in

3 Preventing Denial-of-Service Attacks in Shared CMP Caches 36 its being identified as malicious, by our approach, and its eventual compression into less space. However, this does not significantly impact the performance, because the thread is already experiencing a low hit rate and hence higher latencies. Structure of this paper. Section 2 surveys related work and reviews the StatCache model. Section 3 presents our StatShare model. Section 4 describes how cache decay can be intergrated into the StatShare model and provide attack resistant high-level cache management policies. Section 5 presents implementations and Section 6 our results. Section 7 summarizes the paper. 2 Related Work Cache Partitioning Schemes. The issue of cache fairness has been initially investigated by Kim et al. [2]. They introduce a set of metrics for fair cache sharing and they implemented a static partitioning algorithm for the OS scheduler, and a dynamic three-part algorithm (ini-tialization, rollback and re-partitioning) for sharedcache partitioning. Their algorithms are based on stack-distance counters but do not restrict the cache replacement algorithm to LRU. Their partitioning algorithm is based on counters and partitioning registers. When a process is under-represented in the cache it starts to pick its victims from other processes, while when it is overrepresented, it picks its victims among its own lines. In [3], Kim et al. extend their previous work with three performance models that predict the impact of cache sharing on co-scheduled threads. The input to the models is the isolated second-level cache stack distance of the applications and the output is the number of extra second-level cache misses for each thread due to cache sharing. Suh et al. [] studied partitioning the cache among sharers by modifying the LRU replacement policy. The proposed mechanism used in their scheme is the same as the one used by Kim et al. [2], but their focus is in performance and not fairness. Denial-of-Service at the Architectural Level. One of the initial attempts to prevent DoS attacks at the architectural level was the one introduced by Soderquist and Leeser [9]. The authors proposed the idea of cache locking where the locked cachelines were not allowed to be removed from the cache, quaranteeing freedom from DoS attacks. In their approach, a dynamic cache locking technique, aided by custom processor instructions, treat locked cache lines as additional registers. Recently, many researchers studied the issue of DoS attacks in the context of SMT processors. Because multiple threads share many resources (pipeline, execution units etc.) in a SMT, there are many opportunities for a malicious thread to launch a DoS attack by abusing shared resources. Grunwald and Ghiasi describe a form of attack in which a malicious process repeatedly flushes the trace cache of a SMT by executing self modifying code. Because the trace cache is shared among all the processes, the flushing degrades the performance of all threads [6]. Hasan et al. study DoS attacks based on power density [7]. The above techniques try to address the DoS attacks by stalling the application that is suspected of malicious behavior. This may be a working solution for SMTs but it is less attractive for CMPs, because CMPs have most of their resources unshared. A stalled core

4 362 G. Keramidas et al. in a CMP environment will lead in underutilization of the whole system. Furthermore, a service-targeted system may become unable to provide services even if no malicious threads are running on it [3][8]. In this scenario, the previous techniques will not detect a DoS attack, rendering the whole system practically inoperative. The problem becomes more serious when specific services require QoS guarantees. The Statcache Model. StatCache is a technique for estimating an application's miss rate as a function of cache size based on a very sparse and easily captured fingerprint of certain performance properties [6]. The application property measured is the reuse distance of the application's memory accesses, i.e., the number of memory references between two consecutive accesses to the same cachline. Unlike stack distance, which measures the number of unique memory references between two consecutive accesses to the same cacheline, the reuse distance can easily be captured using functionality supported in today's hardware and operating systems. Benchmarks Miss Ratio Miss Ratio(%) K 4K 8K 6K 32K 64K 28K 256K 52K M 2M 4M Cache Size Fig.. StatCache results for selected SPEC2 The reuse distances of an application's all memory accesses is most easily represented as a histogram h(i), where h() is the number of references to the same cache line with no other intervening memory references, h() is the number of accesses with one intervening access, and so forth. The shape of this histogram is the performance fingerprint of an application. The shape can cheaply be approximated by randomly picking every the N th access and measuring its reuse distance. Experiments have shown that sampling every 7 th access is sufficient for long-running applications [6]. StatCache uses an application's histogram together with a simple statistical model of a cache and a simple numerical solver to derive the miss rate of the application as a function of cache size. Figure shows StatCache results for a number of SPEC2 benchmarks for various cache sizes. This figure provides our motivation for managing the cache and prevent the hog of the cache by cache greedy applications (either they are by nature malicious or not). As it is evident from Figure many programs have flat areas in their miss-rate curves, where a change in their cache size results in virtually no change in their miss rate. Such areas can be exploited to release cache space for other programs than can benefit from more cache space (as suggested by their miss-rate curves).

5 Preventing Denial-of-Service Attacks in Shared CMP Caches StatShare: A Statistical Cache Model in Cat Time In this section, we describe the basic principles of our statistical model. A necessary compromise to construct our model is to assume a fully-associative cache with random replacement. L = 496 Samples art equake reuse distance in CAT Fig. 2. CAT reuse distance histograms for art and equake (both axes log scale) CAT Time. The reuse distance of a cacheline is measured as the number of intervening events -a notion of time- between two consecutive accesses to this cacheline. In [6], reuse distances are measured as the number of other intervening accesses. In contrast, we measure reuse distances with a different notion of time. Our time is measured in Cache Allocation Ticks (CAT) [4], or in other words, cache replacements. The CAT clock can be advanced with two different ways: by snooping the cache replacements irrespective of the thread that causes the replacement in the shared cache (we call this a global CAT clock) or by using the replacements of the particular thread that is replaced (we call this local or per-thread counters). Our theory is independent of which clock, global or local, we use for a thread's histogram, as long as we always relate the global clock to the size of the cache and the local clock to the thread's footprint in the cache. Both ways have each own positives and negatives, but we omitted such analysis due to lack of space. For the didactic purpose of this section, we will assume global CAT as our notion of time. The importance of CAT time stems from the fact that it allows for a natural mapping of the cache statistics to the studied cache level. CAT Reuse-Distance Histograms. The reuse distance histogram of a program measured in CAT time is denoted as: h(i), i =,. Figure 2 shows the histograms for two SPEC2 programs, art and equake, sharing a 256KB cache. The histograms are collected in a time window of 2M instructions and in this case we see reuse distances of up to a few tens-of-thousands CAT. As we can see from the histograms art shows a binary distribution of reuse distances, with the bulk of samples at short reuse distances, but also with a significant bulge beyond L (L=496, the size of the cache in cachelines). This bulge signifies than many of the items that art accesses, do not fit in the cache and produces a significant number of misses. It is responsible for the behavior of art which hogs the cache and squeezes its companion thread to a very small footprint. In contrast, equake shows a

6 364 G. Keramidas et al. distribution of reuse distances that decreases slowly to the right. The meaning of this distribution, as we will show, is that equake is already in a compressed state (we cannot squeeze it further without serious damage to its miss ratio) but it can benefit from expansion to a larger footprint. In general many programs behave either like art or like equake. artlike programs are prime candidates for management-com-pression. Basic Probability Functions. The centerpiece of the StatShare model are the f and f- bar functions. These functions give the probability of a miss (f) or a hit (f-bar) for an item in the cache with a given reuse distance. The f-functions coupled with the reusedistance histograms of threads produce the rest of the information of our statistical model. The f-functions concern a specific replacement policy. As we have mentioned, for the didactic purposes of this section we will assume a fully-associative (FA), random replacement cache where the notion of time is given by a global CAT counter. Probability L=496 f f-bar = - f reuse distance in CAT Fig. 3. f and f-bar for Random replacement in a FA cache Under this scenario, any item in a such cache of size L (in cachelines) has /L probability of being replaced at any miss or ( /L) probability of remaining in the cache. If an item has a CAT reuse distance of i, then after i misses (or replacements), it has a probability of remaining in the cache of ( /L) i and a probability of having been replaced of ( /L) i. We call this miss probability function f, in contrast to the hit probability denoted as f-bar: fi = -- L i fi = fi = -- L Once we have a CAT reuse distance histogram for a thread it is easy to calculate its hits and misses by multiplying it with the f-bar and f functions respectively: hits = h i fi misses = h i f i i = i = The results of these formulas agree with our simulation results with very high accuracy. However, in order to get an accurate count of misses we must take into account cold misses. Cold misses are estimated when we collect samples for the reuse i

7 Preventing Denial-of-Service Attacks in Shared CMP Caches 365 distance histograms of a thread. In short, dangling samples with no observed reuse distance correspond to cold misses [5]. 4 Integrating Decay and LRU Replacement in the Model The StatShare model gives us all the necessary theoretical information on which application we can compress to release space for the benefit of system as a whole. It is a good approximation for relatively large caches (greater than 64KB) of moderate to high associativity (greater than 2) with LRU replacement, such as the likely L2 or L3 caches in CMPs [8][9][]. In this section we describe in abstract terms the StatShare model for LRU replacement and decay. We will not expand in details but give the basic information needed to support our decay-based management policies. In addition, as we will show in rest of this section, using local CAT counters in combination with a decay-driven replacement algorithm, we can precisely control the thread s cache footprint. This characteristic allows us not only to prevent malicious or cache-greedy applications to abuse the shared cache, but it can be used as a methodology to enforce high-level policies such as policies that try to assign custom cache-space-allocation based on external QoS. 4. Per-Thread Histograms Since in this paper we are interested in indentifying unique cache-greedy applications in a shared cache, we use per-thread CAT clocks that are advanced by cache replacements of cachelines belonging to a specific thread, regardless of the thread that causes the replacement. In this way, the CAT clock is insensitive to the status of the whole shared cache, but dedicated to the status (cache requirements) of each individual thread. Collecting histograms of each thread using their own CAT counters creates pure histograms which accurately describe the cache behavior of the thread confined to its space in the cache. This means that the term L, which is the cache size in cachelines with the global CAT counter, is now replaced by the active ratio (in cachelines) of each thread. LRU Replacement. With LRU replacement in a FA cache, the probabilities of a miss or a hit change with respect to those of random replacement. In short, the LRU f- functions are much more steep than the random f-functions and reach their bounds right at L. This is evident, for example, for the f function which reaches just at L since nothing can remain in an LRU FA cache after seeing L replacements. However, the shape of the f-functions before L is complex to derive. Because LRU, unlike random, is not memoryless, the miss and hit probabilities depend on the state of the cache which in turn implies that the f and f-bar functions depend on the reusedistance histograms of the threads. In other words the behavior of LRU depends on the applications. Assume that we have an application which has a miss rate of it has no hits. The f-bar (hit probability) function in this case is a step function: everything with a reuse distance larger than L is guaranteed to be a miss since an item that lives through L replacements is guaranteed to be thrown out of the cache. (However, the only

8 366 G. Keramidas et al. compatible histograms with this f-bar function have no samples inside L otherwise they would have hits.) Now assume that we introduce hits into the cache by having some histogram samples inside L. Hits affect how quickly an item with a given reuse distance moves down the LRU chain. Assume, for example, an item at a position x in the LRU chain (items enter in position and fall out of the cache at position L+). This item is pushed down in the LRU chain either from new items that enter at the top via replacements, or by hits on older items, located below the item in question, which bring them at the top of the LRU chain. The number of possible hits on items located after x is a function of the application s reuse distance histogram. The end result is that the more hits we have the faster an item with a reuse distance less than L can be evicted increasing the probability of misses in small reuse distances. The end result is that f-function are very steep around L (or the equivalent active ratio), and their form at reuse distances less than L depends on the hit ratio and the thread s actual reuse-distance histogram. Decayed f and f-bar Functions. Decay modifies the f-functions of the decayed applications. Once we apply decay to one of the threads that share the cache, the underlying replacement policy of the cache (LRU or Random) is changed, since decayed cachelines take precedence for eviction. The effect of decay on LRU f-functions is to effectively make them step functions: the f-bar function is one almost up to the decay interval D and then rapidly falls to. The explanation is the following: if we decay a thread at a reuse distance D, all its items with smaller reuse distances can be hits as long as there are decayed lines available for replacements. Our modified replacement algorithm chooses a decayed line to replace if there is one available. In addition, decayed items are certain misses since they decay and are replaced. However, for performance reasons we allow hits on decayed items. This results in a discrepancy between our model and our implementation since the decayed f-functions are step function only if decayed items are misses. Thus our models are pessimistic in their assessment of performance. Cache Management. StatShare gives us all the elements required to make informed decisions and construct high-level cache management policies. Using the StatShare outputs, we are able to understand which thread (malicious or not) can be compressed into less space with negligible damage and to drive accordingly the underlying replacement policy of the cache by selecting the appropriate decay intervals. This characteristic allows us not only to prevent malicious or cache-greedy applications to abuse the shared cache, but it can be used as a methodology to enforce high-level policies. The management policy we examine in this paper is as follows: We collect reuse-distance histograms using local (per-thread) CAT counters. We assess the threat that each thread poses based on its reuse-distance histogram. Threads are sorted according to their DoS threat level.

9 Preventing Denial-of-Service Attacks in Shared CMP Caches 367 We assess the performance impact of decaying the most threatening threads using decayed LRU f-functions and we choose an appropriate decay interval for each. Decay intervals are restricted to a small set of L-fractions (e.g., L, L/2, L/4, etc.). Finally, we propose as the appropriate place for using StatShare, the operating system and in particular the thread scheduler. This is because a sampling period is required at the end of which a management decision can be made. Managing the cache must be performed periodically, since threads change behavior in different program phases. In addition, threads are created, suspended, or killed dynamically and each change requires a new management decision. The sampling period must be long enough to have the time to collect useful histograms for the threads. For example, in our evaluation the sampling window is 45M instructions. Finally, Quality-of-Service guarantees that must be taken into account can be easily handled at the OS level. For example, if it is desired externally to give specific space to specific threads, this can be taken into account in the scheduler for adjusting decay intervals to satisfy such requirements. 5 Practical Implementations In this section we show that the abstract theory can be translated into realistic runtime implementations. Reuse-Distance Histogram Collection. At first sight, the nature of the reuse-distance histograms, which potentially span values from to infinity, seems impractical for run-time collection. There are two techniques that make histogram collection not only practical but even efficient: sampling and quantization. Sampling is a technique that was also used in StatCache [6][5]. Instead of collecting reuse distances for all accesses, we select few accesses at random, and only trace those for their reuse distance. The resulting histogram is a scaled version of the original but with the exact same statistical properties. Sampling allows for efficient run-time tracing. In our evaluation our sampling ratio is :24, i.e., we select randomly one out of 24 accesses. The second fundamental technique that allows a practical implementation of StatShare is the quantization of the reuse distance histogram. Normally, it would be impractical to collect and store a histogram with potentially many thousands of buckets. However, samples with small reuse distances are statistically more significant than the ones with very large reuse distances. We use 2 buckets for quantization. In this way, the histograms can be collected in a set of 2 32-bit registers per thread, that are updated by hardware and are visible to the OS similarly to other model-specific registers such as performance counters. We have verified that the outputs of StatShare are practically indistinguishable using either quantized or full histograms. Decay Implementations and Replacement Policies. Our modified replacement algorithm is very simple: we replace any decayed cacheline (randomly) if there is one in the set, or if there is not we use the underlying LRU replacement policy. In order to hold the decay information, we use a set of registers (visible to the OS) to store the decay intervals of each thread. Non-decayed threads have an infinite

10 368 G. Keramidas et al. decay interval corresponding to the largest value of these registers. Cachelines are tagged with the CAT clock which is updated every time a hit or a replacement occurs in the corresponding cacheline. CAT tags can be made just a few bits long [4]. At the time of replacement, the CAT tag of each cacheline is subtracted from the thread CAT clock. If the result is greater than the decay interval of the corresponding thread, the cacheline is decayed and can be chosen for replacement. This check starts at a random place in a set and proceeds until either a decayed line is found or the entire set has been checked. In our methodology, the only decision we make is which decay intervals to use for the various threads. 6 Evaluation For our simulations we have modified an SMT simulator [2] to model a CMP architecture with 2 to 4 cores. Each core is a modest 2-way out-of-order superscalar. The memory hierarchy consists of private Ls instruction and data and a shared, 8-way set-associative, 64B-line L2 cache. The memory latency is 25 cycles. Our intention is to isolate the data accesses behavior of applications, hence we use a relatively large instruction L (MB) to preclude instruction misses from polluting the L2. 64K 256K 52K M 64K 256K 52K M Active Ratios Normalized Misses,8,8,6,6,4,4,2, Thread Thread Misses Misses Fig. 4. Tramp vs. Gzip: active ratios and miss ratios for various decay intervals We use a subset of the most memory intensive SPEC2 benchmarks for our evaluation: art, gzip, equake, mcf and parser. To emulate the impact of a malicious thread, we write our own malicious program named tramp which is designed to be a greedy consumer of the L2. The tramp program scans continuously a very large memory array (bigger than the L2) accessing one byte out of 64 bytes (the size of the L2 block size). In every iteration, a read and a write operation are performed. With this way, the best case miss ratio of the tramp program can be equal to 5%. To understand the behavior of decay in relation with StatShare s outputs, we have simulated a set of co-scheduled applications where one or two of them are decayed. The workload consists of 2 and 4 threads. In some sets, the tramp program has the role of the greedy application, while in some others the same role is taken by the two most memory intensive benchmarks of the SPEC2 suite art and mcf. Although our methodology allows any decay interval to be chosen in order to manage a thread,

11 Preventing Denial-of-Service Attacks in Shared CMP Caches 369 we have constrained the choice of decay intervals to be binary fractions of the corresponding cache size. All our simulations are for 2M instructions per thread. We simulate after skipping B instructions for art and gzip, 2B for mcf, parser, and vpr, and 3B for equake. After skip we warm up the caches for 5M instructions. Management decisions are taken every 45M instructions. In the rest of this Section we discuss results for five representative cases. tramp gzip. In this example art shares the cache with gzip. Figure 4 shows the active ratios and the miss ratios of the two threads for the four caches we consider, and for four decay intervals (decay is applied to tramp). Every set of bars corresponds to a specific cache size (noted on top of the set). The x-axis shows the decay intervals. The first bar (tagged with a label), for each set, stands for infinite decay interval, while the values, 2, and 4 correspond to L, L/2, and L/4 respectively (L is the cache size measured in cachelines). As we can see from Figure 4, our methodology successfully manages to equally divide the cache between the two threads. With an L/4 decay interval both applications have almost the 5% of the cache in all cache sizes. The value of our cache management technique can be seen not only for the active ratios, but for the miss ratios too (miss ratios are normalized to the non-decayed case). tramp is already experiencing a high miss ratio, so compressing it will not significantly impact its performance (as it can be seen from the graph). On the other hand, gzip is the kind of application (as it is shown by the statcache curves Figure ), that can benefit from its space expansion and reduce its miss ratio. The more space it gets the more hits it generates. In the 64K cache, gzip starts (at non decayed state) with 87% miss ratio and ends up (L/4 decay interval) with 62% miss ratio resulting in a normalized reduction of almost 3%. In the MB case, the benefit is more pronounced. gzip starts with 2% miss rate and ends up with a miss ratio less than %. This corresponds to a normalized reduction of 96%. In all cases, the miss ratio of tramp is always constant at 5%. tramp equake. In this case we examine tramp with another SPEC2 program equake. Figure 5 shows the active ratios and the miss ratios for the four cache sizes and for the 4 decay intervals (infinite, L, L/2, L/4). As Figure 5 indicates tramp begins (before decay) by clearly hogging the cache having more than 9% of the cache in the non-decayed state (same results as in the previous example). Once it is decayed, it releases space for the benefit of equake. However, in contrast to gzip, equake cannot exploit its increased space except in the case of the 64K cache. This is evident also from the StatCache curves. Giving more space to equake produces very few additional hits. mcf parser. Our third example is mcf co-scheduled with parser. mcf is one of the two most memory intensive programs of the SPEC2 suite (the other is art). mcf is chosen for decay since it decays better than parser and occupies the most space in the cache (Figure 6). mcf s decay benefits parser with up to a maximum reduction of miss ratio of 23.5% for the 64K, 47% for the 256K, 34% for the 52K, and 25% for the M cache. mcf experiences a slightly increase of 3% in the miss ratio only in the case of M cache.

12 37 G. Keramidas et al. 64K 256K 52K M 64K 256K 52K M Active Ratios Normalized Misses,8,8,6,6,4,4,2, Thread Thread Misses Misses Fig. 5. Tramp vs. equake: active ratios and miss ratios for various decay intervals 64K 256K 52K M 64K 256K 52K M Active Ratios Normalized Misses,8,8,6,6,4,4,2, Thread Thread Misses Misses Fig. 6. Mcf vs. parser: active ratios and miss ratios for various decay intervals tramp gzip parser vpr. In this example, we evaluate our methodology when the L2 cache is shared among 4 threads tramp, gzip, parser, and vpr. Figure 7 shows the active ratios and the miss ratios in this case. The interesting observation that can be made from Figure 7 is that tramp must be decayed harder in order to see significant changes in its active ratio. Thus, we expand the decay intervals up to L/6 (our management algorithm always picks tramp as the decayed application). In the 2M case and for decay interval equal to L/6, tramp s miss ratio is increased by 2%, while its cache footprint has been decreased by a factor of 2.9 compared to the non decayed case. The released space by tramp benefits the other 3 applications. gzip increases its space by.3x, parser by.7x, and vpr by.4x. These expansions lead to a decrease in miss ratio of % for gzip, 2% for parser and 8% for vpr. tramp art equake gzip. Finally, we give a 4-thread example where decay is applied to two applications tramp and art since they both pose significant threat for DoS and can be significantly compressed. This two-thread decay management decision works very well since, when only tramp is decayed, its released space is occupied directly by art. art s aggressive behavior does not let the other two threads benefit from tramp s compression. On the other hand, even though art increases its cache footprint, its miss ratio does not show considerable improvements. Figure 8 presents the active ratios and miss ratios for this example. The first bar of every set corresponds to the non-decayed case (none of the applications are decayed). In the rest of the bars, tramp has a constant decay interval equal to L/6, while art s decay intervals are shown in the x-axis (L, L/2, L/4, L/8, L/6).

13 Preventing Denial-of-Service Attacks in Shared CMP Caches 37 As we can see from Figure 8, equake and gzip benefit from art s and tramp s compression. In the M cache, equake increases its space by 4x and gzip by 2.5x. However, equake, in contrast to gzip, cannot exploit its increased space leading to a meagre 2% decrease (improve-ment) in its miss ratio while gzip experiences an impressive 43% decrease. The results are analogous for the other cache sizes with a big difference in art s behavior in the 2M cache. As we can see from Figure, art is no longer in its flat area, so if we try to compress it, we will destroy its performance, as it is evident from Figure 8 (2M case). art is not a good candidate for decay in this case. 256K 52K M 2M 256K 52K M 2M Active Ratios Normalized Misses,8,8,6,6,4,4,2, Thread Thread Thread 2 Thread Misses Misses Misses 2 Misses 3 Fig. 7. Tramp-gzip-parser-vpr: active ratios and miss ratios for various decay intervals 256K 52K M 2M 256K 52K M 2M Active Ratios Normalized Misses,4,2,8,6,8,4,6,4,2, Active Ratio Active Ratio Active Ratio 2 Active Ratio 3 Misses Misses Misses2 Misses3 Fig. 8. Tramp-art-equake-gzip: active ratios and miss ratios for various decay intervals 7 Conclusions In this paper, we demonstrate a new management methodology for shared caches in CMP systems, that utilizes statistical run-time information of the application behavior in order to deal with Denial-of-Service attacks. Our methodology does not need to distinguish between malicious programs and greedy but not-by-nature malicious programs, since these two categories behave similarly in terms of reuse distance histograms. This leads us to a more generalized approach, where dealing with DoS attacks is similar to enforcing QoS constraints or sharing the cache in a fair way. The proposed methodology is evaluated using a detailed CMP simulator running the most memory intensive SPEC2 applications and a tramp program which is designed to be an excellent consumer of the shared cache. Our results indicate that our attack-resistant cache management methodology makes it possible to identify which application (malicious or not) can be compressed into less cache space with

14 372 G. Keramidas et al. negligible damage and modify accordingly in run-time the underlying replacement algorithm of the cache using decay. Our results show significant benefits across the board with minimal damage for the managed threads. References [] G. E. Suh, S. Devadas, and L. Rudolph. A new memory monitoring scheme for memoryaware scheduling and partitioning High-Performance Computer Architecture HPCA'2, 22. [2] S. Kim, D. Chandra and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture Parallel Architectures and Compilation Techniques, PACT'4, 24. [3] D. Chandra, F. Guo, S. Kim and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture High-Performance Computer Architecture HPCA'5, 25. [4] M. Karlsson and E. Hagersten. Timestamp-Based Selective Cache Allocation In High Performance Memory Systems, edited by H. Hadimiouglu, et al., Springer-Verlag, 23. [5] E. Berg, H. Zeffer, and E. Hagersten. A Statistical Multiprocessor Cache Model International Symposium on Performance Analysis of Systems and Software (ISPASS- 26), USA, 26. [6] E. Berg and E. Hagersten. Fast Data-Locality Profiling of Native Execution ACM SIGMETRICS 25, Canada, 25. [7] S. Kaxiras, Z. Hu, M. Martonosi. Cache Decay: Exploiting Generational Behavior to Reduce Cache Leakage Power International Symposium on Computer Architecture ISCA 28, 2. [8] P. Kongetira, K. Aingaran, and K. Olukutun. Niagara: A 32-Way Multithreaded SPARC Processor In IEEE Micro, 25. [9] K. Krewell. Power5 Tops on Bandwidth. In Microprocessor Report, 23. [] K. Krewell. Double Your Opterons; Double Your Fun. In Microprocessor Report, 24. [] J. Hennessy and D. Patterson. Computer Architecture: a Quantitative Approach. Morgan-Kaufmann Publishers, Inc., 2nd edition, 996. [2] R Goncalves, E Ayguade, M Valero and P Navaux A Simulator for SMT Architectures: Evaluating Instruction Cache Topologies 2 th Symposium on Computer Architecture and High Performance, (SBAC-PAD 2), 2. [3] R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L.Traiger. Evaluation techniques for storage hierarchies IBM Systems Journal, 97. [4] CNN. Immense network assault takes down Yahoo, 2. Available at [5] Netscape. Leading Web sites under attack, 2. Available at [6] D. Grunwald and S. Ghiasi. Microarchitectural denial of service: insuring microarchitectural fairness International Symposium on Microarchitecture MICRO-35, 22. [7] J. Hasan, A. Jalote, T. N. Vijaykumar, and C. E. Brodley. Heat Stroke: Power-Density- Based Denial of Service in SMT High Performance Computer Architecture HPCA'5, 25. [8] Techtarget.com. Technology terms: Denial of service. Available at /definition/,289893,sid9 gci2359,.html. [9] P. Soderquist and M. Leeser. Optimizing the Data Cache Performance of a Software MPEG-2 Video Decoder In ACM Multimedia 97 - Electronic Proceedings, 997.

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Department of Electrical Engineering Princeton University {carolewu, mrm}@princeton.edu Abstract

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

Understanding the Impact of Inter-Thread Cache Interference on ILP in Modern SMT Processors

Understanding the Impact of Inter-Thread Cache Interference on ILP in Modern SMT Processors Journal of Instruction-Level Parallelism 7 (25) 1-28 Submitted 2/25; published 6/25 Understanding the Impact of Inter-Thread Cache Interference on ILP in Modern SMT Processors Joshua Kihm Alex Settle Andrew

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

Energy-Efficient, High-Performance Heterogeneous Core Design

Energy-Efficient, High-Performance Heterogeneous Core Design Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

More information

Visualization Enables the Programmer to Reduce Cache Misses

Visualization Enables the Programmer to Reduce Cache Misses Visualization Enables the Programmer to Reduce Cache Misses Kristof Beyls and Erik H. D Hollander and Yijun Yu Electronics and Information Systems Ghent University Sint-Pietersnieuwstraat 41, Gent, Belgium

More information

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy The Quest for Speed - Memory Cache Memory CSE 4, Spring 25 Computer Systems http://www.cs.washington.edu/4 If all memory accesses (IF/lw/sw) accessed main memory, programs would run 20 times slower And

More information

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

More information

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS AN INSTRUCTION WINDOW THAT CAN TOLERATE LATENCIES TO DRAM MEMORY IS PROHIBITIVELY COMPLEX AND POWER HUNGRY. TO AVOID HAVING TO

More information

systems for program behavior analysis. 2 Programming systems is an expanded term for compilers, referring to both static and dynamic

systems for program behavior analysis. 2 Programming systems is an expanded term for compilers, referring to both static and dynamic Combining Locality Analysis with Online Proactive Job Co-Scheduling in Chip Multiprocessors Yunlian Jiang Kai Tian Xipeng Shen Computer Science Department The College of William and Mary, Williamsburg,

More information

Choosing a Computer for Running SLX, P3D, and P5

Choosing a Computer for Running SLX, P3D, and P5 Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line

More information

A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems*

A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems* A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems* Junho Jang, Saeyoung Han, Sungyong Park, and Jihoon Yang Department of Computer Science and Interdisciplinary Program

More information

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

More information

Performance Impacts of Non-blocking Caches in Out-of-order Processors

Performance Impacts of Non-blocking Caches in Out-of-order Processors Performance Impacts of Non-blocking Caches in Out-of-order Processors Sheng Li; Ke Chen; Jay B. Brockman; Norman P. Jouppi HP Laboratories HPL-2011-65 Keyword(s): Non-blocking cache; MSHR; Out-of-order

More information

Towards Architecture Independent Metrics for Multicore Performance Analysis

Towards Architecture Independent Metrics for Multicore Performance Analysis Towards Architecture Independent Metrics for Multicore Performance Analysis Milind Kulkarni, Vijay Pai, and Derek Schuff School of Electrical and Computer Engineering Purdue University {milind, vpai, dschuff}@purdue.edu

More information

HyperThreading Support in VMware ESX Server 2.1

HyperThreading Support in VMware ESX Server 2.1 HyperThreading Support in VMware ESX Server 2.1 Summary VMware ESX Server 2.1 now fully supports Intel s new Hyper-Threading Technology (HT). This paper explains the changes that an administrator can expect

More information

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Measuring Cache and Memory Latency and CPU to Memory Bandwidth White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary

More information

Addressing Shared Resource Contention in Multicore Processors via Scheduling

Addressing Shared Resource Contention in Multicore Processors via Scheduling Addressing Shared Resource Contention in Multicore Processors via Scheduling Sergey Zhuravlev Sergey Blagodurov Alexandra Fedorova School of Computing Science, Simon Fraser University, Vancouver, Canada

More information

Ensuring Quality of Service in High Performance Servers

Ensuring Quality of Service in High Performance Servers Ensuring Quality of Service in High Performance Servers YAN SOLIHIN Fei Guo, Seongbeom Kim, Fang Liu Center of Efficient, Secure, and Reliable Computing (CESR) North Carolina State University solihin@ece.ncsu.edu

More information

Capacity Planning Process Estimating the load Initial configuration

Capacity Planning Process Estimating the load Initial configuration Capacity Planning Any data warehouse solution will grow over time, sometimes quite dramatically. It is essential that the components of the solution (hardware, software, and database) are capable of supporting

More information

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking Kathlene Hurt and Eugene John Department of Electrical and Computer Engineering University of Texas at San Antonio

More information

Concept of Cache in web proxies

Concept of Cache in web proxies Concept of Cache in web proxies Chan Kit Wai and Somasundaram Meiyappan 1. Introduction Caching is an effective performance enhancing technique that has been used in computer systems for decades. However,

More information

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com

More information

The Classical Architecture. Storage 1 / 36

The Classical Architecture. Storage 1 / 36 1 / 36 The Problem Application Data? Filesystem Logical Drive Physical Drive 2 / 36 Requirements There are different classes of requirements: Data Independence application is shielded from physical storage

More information

DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs. Presenter: Bo Zhang Yulin Shi

DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs. Presenter: Bo Zhang Yulin Shi DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs Presenter: Bo Zhang Yulin Shi Outline Motivation & Goal Solution - DACOTA overview Technical Insights Experimental Evaluation

More information

Optimizing Shared Resource Contention in HPC Clusters

Optimizing Shared Resource Contention in HPC Clusters Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs

More information

Compatible Phase Co-Scheduling on a CMP of Multi-Threaded Processors Ali El-Moursy?, Rajeev Garg?, David H. Albonesi y and Sandhya Dwarkadas?

Compatible Phase Co-Scheduling on a CMP of Multi-Threaded Processors Ali El-Moursy?, Rajeev Garg?, David H. Albonesi y and Sandhya Dwarkadas? Compatible Phase Co-Scheduling on a CMP of Multi-Threaded Processors Ali El-Moursy?, Rajeev Garg?, David H. Albonesi y and Sandhya Dwarkadas?? Depments of Electrical and Computer Engineering and of Computer

More information

Modeling Virtual Machine Performance: Challenges and Approaches

Modeling Virtual Machine Performance: Challenges and Approaches Modeling Virtual Machine Performance: Challenges and Approaches Omesh Tickoo Ravi Iyer Ramesh Illikkal Don Newell Intel Corporation Intel Corporation Intel Corporation Intel Corporation omesh.tickoo@intel.com

More information

A Predictive Model for Cache-Based Side Channels in Multicore and Multithreaded Microprocessors

A Predictive Model for Cache-Based Side Channels in Multicore and Multithreaded Microprocessors A Predictive Model for Cache-Based Side Channels in Multicore and Multithreaded Microprocessors Leonid Domnitser, Nael Abu-Ghazaleh and Dmitry Ponomarev Department of Computer Science SUNY-Binghamton {lenny,

More information

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007 Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

More information

Precise and Accurate Processor Simulation

Precise and Accurate Processor Simulation Precise and Accurate Processor Simulation Harold Cain, Kevin Lepak, Brandon Schwartz, and Mikko H. Lipasti University of Wisconsin Madison http://www.ece.wisc.edu/~pharm Performance Modeling Analytical

More information

Muse Server Sizing. 18 June 2012. Document Version 0.0.1.9 Muse 2.7.0.0

Muse Server Sizing. 18 June 2012. Document Version 0.0.1.9 Muse 2.7.0.0 Muse Server Sizing 18 June 2012 Document Version 0.0.1.9 Muse 2.7.0.0 Notice No part of this publication may be reproduced stored in a retrieval system, or transmitted, in any form or by any means, without

More information

Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.

Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier. Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core

More information

Infrastructure Matters: POWER8 vs. Xeon x86

Infrastructure Matters: POWER8 vs. Xeon x86 Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report

More information

Hardware Configuration Guide

Hardware Configuration Guide Hardware Configuration Guide Contents Contents... 1 Annotation... 1 Factors to consider... 2 Machine Count... 2 Data Size... 2 Data Size Total... 2 Daily Backup Data Size... 2 Unique Data Percentage...

More information

Control 2004, University of Bath, UK, September 2004

Control 2004, University of Bath, UK, September 2004 Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of

More information

Confinement Problem. The confinement problem Isolating entities. Example Problem. Server balances bank accounts for clients Server security issues:

Confinement Problem. The confinement problem Isolating entities. Example Problem. Server balances bank accounts for clients Server security issues: Confinement Problem The confinement problem Isolating entities Virtual machines Sandboxes Covert channels Mitigation 1 Example Problem Server balances bank accounts for clients Server security issues:

More information

TPCalc : a throughput calculator for computer architecture studies

TPCalc : a throughput calculator for computer architecture studies TPCalc : a throughput calculator for computer architecture studies Pierre Michaud Stijn Eyerman Wouter Rogiest IRISA/INRIA Ghent University Ghent University pierre.michaud@inria.fr Stijn.Eyerman@elis.UGent.be

More information

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches: Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):

More information

Thread level parallelism

Thread level parallelism Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process

More information

Carlos Villavieja, Nacho Navarro {cvillavi,nacho}@ac.upc.edu. Arati Baliga, Liviu Iftode {aratib,liviu}@cs.rutgers.edu

Carlos Villavieja, Nacho Navarro {cvillavi,nacho}@ac.upc.edu. Arati Baliga, Liviu Iftode {aratib,liviu}@cs.rutgers.edu Continuous Monitoring using MultiCores Carlos Villavieja, Nacho Navarro {cvillavi,nacho}@ac.upc.edu Arati Baliga, Liviu Iftode {aratib,liviu}@cs.rutgers.edu Motivation Intrusion detection Intruder gets

More information

Cache-Fair Thread Scheduling for Multicore Processors

Cache-Fair Thread Scheduling for Multicore Processors Cache-Fair Thread Scheduling for Multicore Processors Alexandra Fedorova, Margo Seltzer and Michael D. Smith Harvard University, Sun Microsystems ABSTRACT We present a new operating system scheduling algorithm

More information

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.

More information

Chapter 5 Linux Load Balancing Mechanisms

Chapter 5 Linux Load Balancing Mechanisms Chapter 5 Linux Load Balancing Mechanisms Load balancing mechanisms in multiprocessor systems have two compatible objectives. One is to prevent processors from being idle while others processors still

More information

A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing

A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing Liang-Teh Lee, Kang-Yuan Liu, Hui-Yang Huang and Chia-Ying Tseng Department of Computer Science and Engineering,

More information

Operating Systems. Virtual Memory

Operating Systems. Virtual Memory Operating Systems Virtual Memory Virtual Memory Topics. Memory Hierarchy. Why Virtual Memory. Virtual Memory Issues. Virtual Memory Solutions. Locality of Reference. Virtual Memory with Segmentation. Page

More information

Vulnerability Analysis of Hash Tables to Sophisticated DDoS Attacks

Vulnerability Analysis of Hash Tables to Sophisticated DDoS Attacks International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 12 (2014), pp. 1167-1173 International Research Publications House http://www. irphouse.com Vulnerability

More information

Detailed Characterization of HPC Applications for Co-Scheduling

Detailed Characterization of HPC Applications for Co-Scheduling Detailed Characterization of HPC Applications for Co-Scheduling ABSTRACT Josef Weidendorfer Department of Informatics, Chair for Computer Architecture Technische Universität München weidendo@in.tum.de

More information

The Importance of Software License Server Monitoring

The Importance of Software License Server Monitoring The Importance of Software License Server Monitoring NetworkComputer How Shorter Running Jobs Can Help In Optimizing Your Resource Utilization White Paper Introduction Semiconductor companies typically

More information

Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures

Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov School of Computing Science Simon Fraser University Vancouver, Canada dsa5@cs.sfu.ca Alexandra Fedorova School

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems

Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems Thomas Moscibroda Onur Mutlu Microsoft Research {moscitho,onur}@microsoft.com Abstract We are entering the multi-core era in computer

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides

More information

Rackspace Cloud Databases and Container-based Virtualization

Rackspace Cloud Databases and Container-based Virtualization Rackspace Cloud Databases and Container-based Virtualization August 2012 J.R. Arredondo @jrarredondo Page 1 of 6 INTRODUCTION When Rackspace set out to build the Cloud Databases product, we asked many

More information

A Visualization System and Monitoring Tool to Measure Concurrency in MPICH Programs

A Visualization System and Monitoring Tool to Measure Concurrency in MPICH Programs A Visualization System and Monitoring Tool to Measure Concurrency in MPICH Programs Michael Scherger Department of Computer Science Texas Christian University Email: m.scherger@tcu.edu Zakir Hussain Syed

More information

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters Interpreters and virtual machines Michel Schinz 2007 03 23 Interpreters Interpreters Why interpreters? An interpreter is a program that executes another program, represented as some kind of data-structure.

More information

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers The Orca Chip... Heart of IBM s RISC System/6000 Value Servers Ravi Arimilli IBM RISC System/6000 Division 1 Agenda. Server Background. Cache Heirarchy Performance Study. RS/6000 Value Server System Structure.

More information

18-548/15-548 Associativity 9/16/98. 7 Associativity. 18-548/15-548 Memory System Architecture Philip Koopman September 16, 1998

18-548/15-548 Associativity 9/16/98. 7 Associativity. 18-548/15-548 Memory System Architecture Philip Koopman September 16, 1998 7 Associativity 18-548/15-548 Memory System Architecture Philip Koopman September 16, 1998 Required Reading: Cragon pg. 166-174 Assignments By next class read about data management policies: Cragon 2.2.4-2.2.6,

More information

Performance Monitoring of Parallel Scientific Applications

Performance Monitoring of Parallel Scientific Applications Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure

More information

Virtuoso and Database Scalability

Virtuoso and Database Scalability Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of

More information

How to Optimize 3D CMP Cache Hierarchy

How to Optimize 3D CMP Cache Hierarchy 3D Cache Hierarchy Optimization Leonid Yavits, Amir Morad, Ran Ginosar Department of Electrical Engineering Technion Israel Institute of Technology Haifa, Israel yavits@tx.technion.ac.il, amirm@tx.technion.ac.il,

More information

Multi-core and Linux* Kernel

Multi-core and Linux* Kernel Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores

More information

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department

More information

SIDN Server Measurements

SIDN Server Measurements SIDN Server Measurements Yuri Schaeffer 1, NLnet Labs NLnet Labs document 2010-003 July 19, 2010 1 Introduction For future capacity planning SIDN would like to have an insight on the required resources

More information

An Event-Driven Multithreaded Dynamic Optimization Framework

An Event-Driven Multithreaded Dynamic Optimization Framework In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), Sept. 2005. An Event-Driven Multithreaded Dynamic Optimization Framework Weifeng Zhang Brad Calder

More information

Invited Applications Paper

Invited Applications Paper Invited Applications Paper - - Thore Graepel Joaquin Quiñonero Candela Thomas Borchert Ralf Herbrich Microsoft Research Ltd., 7 J J Thomson Avenue, Cambridge CB3 0FB, UK THOREG@MICROSOFT.COM JOAQUINC@MICROSOFT.COM

More information

Runtime Hardware Reconfiguration using Machine Learning

Runtime Hardware Reconfiguration using Machine Learning Runtime Hardware Reconfiguration using Machine Learning Tanmay Gangwani University of Illinois, Urbana-Champaign gangwan2@illinois.edu Abstract Tailoring the machine hardware to varying needs of the software

More information

18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

IA-64 Application Developer s Architecture Guide

IA-64 Application Developer s Architecture Guide IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve

More information

Introduction to Microprocessors

Introduction to Microprocessors Introduction to Microprocessors Yuri Baida yuri.baida@gmail.com yuriy.v.baida@intel.com October 2, 2010 Moscow Institute of Physics and Technology Agenda Background and History What is a microprocessor?

More information

PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE

PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE Sudha M 1, Harish G M 2, Nandan A 3, Usha J 4 1 Department of MCA, R V College of Engineering, Bangalore : 560059, India sudha.mooki@gmail.com 2 Department

More information

Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs)

Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs) Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs) 1. Foreword Magento is a PHP/Zend application which intensively uses the CPU. Since version 1.1.6, each new version includes some

More information

An Adaptive Task-Core Ratio Load Balancing Strategy for Multi-core Processors

An Adaptive Task-Core Ratio Load Balancing Strategy for Multi-core Processors An Adaptive Task-Core Ratio Load Balancing Strategy for Multi-core Processors Ian K. T. Tan, Member, IACSIT, Chai Ian, and Poo Kuan Hoong Abstract With the proliferation of multi-core processors in servers,

More information

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Michael Bauer, Srinivasan Ravichandran University of Wisconsin-Madison Department of Computer Sciences {bauer, srini}@cs.wisc.edu

More information

Measuring the Performance of Prefetching Proxy Caches

Measuring the Performance of Prefetching Proxy Caches Measuring the Performance of Prefetching Proxy Caches Brian D. Davison davison@cs.rutgers.edu Department of Computer Science Rutgers, The State University of New Jersey The Problem Traffic Growth User

More information

FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

More information

Operating System Impact on SMT Architecture

Operating System Impact on SMT Architecture Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th

More information

Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi

Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi University of Utah & HP Labs 1 Large Caches Cache hierarchies

More information

Violin Memory 7300 Flash Storage Platform Supports Multiple Primary Storage Workloads

Violin Memory 7300 Flash Storage Platform Supports Multiple Primary Storage Workloads Violin Memory 7300 Flash Storage Platform Supports Multiple Primary Storage Workloads Web server, SQL Server OLTP, Exchange Jetstress, and SharePoint Workloads Can Run Simultaneously on One Violin Memory

More information

Speeding Up Cloud/Server Applications Using Flash Memory

Speeding Up Cloud/Server Applications Using Flash Memory Speeding Up Cloud/Server Applications Using Flash Memory Sudipta Sengupta Microsoft Research, Redmond, WA, USA Contains work that is joint with B. Debnath (Univ. of Minnesota) and J. Li (Microsoft Research,

More information

Keywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.

Keywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age. Volume 3, Issue 10, October 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Load Measurement

More information

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU, SOORAJ PUTHOOR, BRADFORD M BECKMANN, MARK D HILL*, STEVEN K REINHARDT, DAVID A WOOD* *University of

More information

Offline sorting buffers on Line

Offline sorting buffers on Line Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Implementation of Buffer Cache Simulator for Hybrid Main Memory and Flash Memory Storages

Implementation of Buffer Cache Simulator for Hybrid Main Memory and Flash Memory Storages Implementation of Buffer Cache Simulator for Hybrid Main Memory and Flash Memory Storages Soohyun Yang and Yeonseung Ryu Department of Computer Engineering, Myongji University Yongin, Gyeonggi-do, Korea

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

More information

x64 Servers: Do you want 64 or 32 bit apps with that server?

x64 Servers: Do you want 64 or 32 bit apps with that server? TMurgent Technologies x64 Servers: Do you want 64 or 32 bit apps with that server? White Paper by Tim Mangan TMurgent Technologies February, 2006 Introduction New servers based on what is generally called

More information

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden smakadir@csc.kth.se,

More information

SECURING APACHE : DOS & DDOS ATTACKS - I

SECURING APACHE : DOS & DDOS ATTACKS - I SECURING APACHE : DOS & DDOS ATTACKS - I In this part of the series, we focus on DoS/DDoS attacks, which have been among the major threats to Web servers since the beginning of the Web 2.0 era. Denial

More information

The Truth Behind IBM AIX LPAR Performance

The Truth Behind IBM AIX LPAR Performance The Truth Behind IBM AIX LPAR Performance Yann Guernion, VP Technology EMEA HEADQUARTERS AMERICAS HEADQUARTERS Tour Franklin 92042 Paris La Défense Cedex France +33 [0] 1 47 73 12 12 info@orsyp.com www.orsyp.com

More information

MONITORING power consumption of a microprocessor

MONITORING power consumption of a microprocessor IEEE TRANSACTIONS ON CIRCUIT AND SYSTEMS-II, VOL. X, NO. Y, JANUARY XXXX 1 A Study on the use of Performance Counters to Estimate Power in Microprocessors Rance Rodrigues, Member, IEEE, Arunachalam Annamalai,

More information

Categories and Subject Descriptors C.1.1 [Processor Architecture]: Single Data Stream Architectures. General Terms Performance, Design.

Categories and Subject Descriptors C.1.1 [Processor Architecture]: Single Data Stream Architectures. General Terms Performance, Design. Enhancing Memory Level Parallelism via Recovery-Free Value Prediction Huiyang Zhou Thomas M. Conte Department of Electrical and Computer Engineering North Carolina State University 1-919-513-2014 {hzhou,

More information

Random vs. Structure-Based Testing of Answer-Set Programs: An Experimental Comparison

Random vs. Structure-Based Testing of Answer-Set Programs: An Experimental Comparison Random vs. Structure-Based Testing of Answer-Set Programs: An Experimental Comparison Tomi Janhunen 1, Ilkka Niemelä 1, Johannes Oetsch 2, Jörg Pührer 2, and Hans Tompits 2 1 Aalto University, Department

More information

FUSION iocontrol HYBRID STORAGE ARCHITECTURE 1 WWW.FUSIONIO.COM

FUSION iocontrol HYBRID STORAGE ARCHITECTURE 1 WWW.FUSIONIO.COM 1 WWW.FUSIONIO.COM FUSION iocontrol HYBRID STORAGE ARCHITECTURE Contents Contents... 2 1 The Storage I/O and Management Gap... 3 2 Closing the Gap with Fusion-io... 4 2.1 Flash storage, the Right Way...

More information

Real-Time Analysis of CDN in an Academic Institute: A Simulation Study

Real-Time Analysis of CDN in an Academic Institute: A Simulation Study Journal of Algorithms & Computational Technology Vol. 6 No. 3 483 Real-Time Analysis of CDN in an Academic Institute: A Simulation Study N. Ramachandran * and P. Sivaprakasam + *Indian Institute of Management

More information

Firewalls Overview and Best Practices. White Paper

Firewalls Overview and Best Practices. White Paper Firewalls Overview and Best Practices White Paper Copyright Decipher Information Systems, 2005. All rights reserved. The information in this publication is furnished for information use only, does not

More information

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique Jyoti Malhotra 1,Priya Ghyare 2 Associate Professor, Dept. of Information Technology, MIT College of

More information