Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support

Size: px
Start display at page:

Download "Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support"

Transcription

1 Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support Xiangyu Dong, Yuan Xie, Naveen Muralimanohar, Norman P. Jouppi Computer Science and Engineering, Pennsylvania State University, Exascale Computing Lab, Hewlett-Packard Labs Abstract System-in-Package (SiP) and 3D integration are promising technologies to bring more memory onto a microprocessor package to mitigate the memory wall problem. In this paper, instead of using them to build caches, we study a heterogenous main memory using both on- and off-package memories providing both fast and high-bandwidth on-package accesses and expandable and low-cost commodity off-package memory capacity. We introduce another layer of address translation coupled with an on-chip memory controller that can dynamically migrate data between off-package and off-package memory either in hardware or with operating system assistance depending on the migration granularity. Our experimental results demonstrate that such design can achieve the average effectiveness of 83% of the ideal case where all memory can be placed in high-speed on-package memory for our simulated benchmarks 1. I. INTRODUCTION Recent trends of multi/many core microprocessor design with increasing number of cores have accentuated the already daunting memory bandwidth problem. The traditional approach to mitigate the memory wall problem is to add more storage on-chip in the form of last-level cache (LLC). For example, IBM POWER7 microprocessor has a 32MB L3 cache built out of embedded-dram (edram) technology. The decrease in miss-rate achieved by the extra cache size helps hide the latency gap between a processor and its memory. Emerging System-in-Package (SiP) and threedimensional (3D) integration technologies [1], [2] enable designers to integrate gigabytes of memory into the microprocessor package. However, using the on-package memory resources as a last-level cache (LLC) might not be the best solution. High-performance commodity DRAM dies such as GDDR are heavily optimized for cost and do not include specialized high performance tag arrays that can automatically determine a cache hit/miss and forward the request to the corresponding data array. Since the size of a tag array can be a hundred megabytes or more for a multi-gigabyte cache, storing tags in the microprocessor die is not feasible and the only alternative is to put the LLC tags in the on-package DRAM. 2 Moreover, it will dissipate too much power and waste too much bandwidth if all the ways are speculatively read from a highly associative on-package cache (e.g., 16-way) at the same time. Doing so would either 1 X. Dong and Y. Xie were supported in part by NSF grants 72617, 93432, 95365, and SRC grants. 2 If there are 32 DRAM dies stacked in the package, and cache tags are 6.7% of the data size, this would still require the equivalent of 2.1 DRAM chips of area. Furthermore, the multi-core CPU chip is unlikely to support DRAM as dense as the commodity DRAM chips. increase the bandwidth and I/O requirements of the CPU chip by a factor of 16 and DRAM dynamic power by a factor of 16, or if the I/O count and DRAM power were kept constant would reduce the on-package memory bandwidth by a factor of 16. Instead we implement a 15-way set associative cache in the space of a 16-way set-associative data array, packing all the tags for a set into the 16th cache line for each set. We then access the tags first, and then access the data after a tag hit when the data way location is known. This makes the cache miss/hit determination time roughly equal to the on-package DRAM access time, and makes returning the data on a cache hit take approximately 2X of the time to access the on-package DRAM. Note that even if custom cache DRAM chips were developed, for similar reasons two sequential DRAM accesses would still be required for returning data on a cache hit. Consequently, in this paper, instead of utilizing these onpackage memory resources to augment existing caches or deepen the cache hierarchy, we propose a heterogeneous main memory architecture that consists of both on-package memories and off-package Dual Inline Memory Modules (DIMMs). To manage such a space and move frequently accessed data to fast regions, we propose two integrated memory controller schemes: a first technique handles everything in hardware and our second scheme takes assistance from the operating system. The effectiveness of each scheme depends on the memory management granularity. Through our evaluation, we show that our low-overhead solution can reduce off-package memory access traffic by 83% on average. II. HETEROGENEOUS MAIN MEMORY SPACE In this work, we assume that for future microprocessor packages, DRAM dies are placed beside the microprocessor die using SiP as shown in Fig. 1. As the flip-chip SiP can provide die-to-die bandwidth of at least 2Tbps [3], the on-package high-performance DRAM chip is modified from existing commodity products to increase the number of banks and further increase the signal I/O speed to take advantage of the high-speed on-package interconnects. However, we do not assume a custom tag part and use only a single on-package DRAM design in order to reduce design cost and maximize the volume of the on-package DRAM parts. Although we leave the room for the potential through-silicon-vias-based (TSV-based) 3D stacking in the future, in this work, we assume the onpackage memory size is around hundreds of megabytes due to the consideration that the capabilities of delivering power into 21 IEEE Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. SC1 November 21, New Orleans, Louisiana, USA /1/$26.

2 On-package region DRAM Off-package region Potential 3D stacking Microprocessor DIMMs (Side view) DRAM CPU package DRAM DRAM DRAM DRAM (Top view) DRAM (ECC) Bank Bank Bank Bank Bank Bank Microprocessor Memory controller DRAM DRAM DRAM DRAM.... Bank Bank N-2 Routing interconnect on silicon interposer Each DRAM chip has N banks (N >> 8) When using as cache, each DRAM row is partitioned into 1 tag and 15 data Fig. 1. The conceptual view of the System-in-Package (SiP) solution with 1 microprocessor die and 9 DRAM dies connecting off-package DIMMs (one on-package DRAM die is for ECC). The on-package DRAM chip is slightly different from the commodity off-package DRAM as it has a many-bank structure to reduce the memory access queuing delay. When the on-package DRAM dies are configured to be LLC, they function as 15-way associative cache since each DRAM row is partitioned into 1 tag and 15 data. and dissipating heat out of the chip package are still limited. However, the available on-package memory capacity is much larger than the state-of-the-art LLC capacity. Because the on-package DRAM can have fast access speed, it is ideal we can use them as the unified system main memory [4] [7]. While some application domains do not require lots of main memory [4], generally the aggregate memory on-package, which is assumed to be a gigabyte in this work, is still not sufficient to hold the entire main memory space, which can be several gigabytes. For our system, in addition to the on-package DRAM, we model four DDR3 channels connected to traditional DRAM DIMMs. For the rest of the paper, we refer to these DIMMs as off-package DRAM. A. On-Chip Memory Controller It is not practical to implement the heterogeneous main memory space if the memory controller is off-chip and all the memory accesses have to leave the microprocessor package first. However, with the help of an on-chip memory controller, it becomes simple to form a heterogeneous memory space with both on- and off-package memories. As shown in Fig. 1, the memory controller on the microprocessor die is connected to off-package DIMMs with the conventional 64-bit DDRx bus and on-package memory with a customized memory bus. MSBs of physical memory addresses are used to decode the target location. For example, if 1GB of 32-bit memory space is on-package, Addr[31..3]= is mapped to on-package memory while Addr[31..3]={1,1,11} is mapped to offpackage DIMMs. It is necessary to make several minor modifications so that the memory controller has the capability of mapping physical memory addresses to either on-package or off-package regions. Fig. 2 and Fig. 3 show the difference between a normal on-package DRAM memory controller and our proposed heterogeneity-aware one. The modifications are listed as follows, As shown in Fig. 2, the conventional memory controller first schedules memory transactions by combining and reordering to achieve performance optimization, and then the memory physical address is further resolved into Fig. 2. Transaction Scheduling Address translation migration cmd Address Translation Bank Bank 1 Bank n Command Encoding Electrical Signaling to DIMMs Illustration of a conventional on-chip DRAM memory controller. remap Migration controller On-package scheduling Off-package scheduling Bank Bank m Bank m+1 Bank n On-package command encoding Off-package command encoding Onpackage memory Electrical signaling to DIMMs Fig. 3. Illustration of the proposed heterogeneity-aware on-chip DRAM memory controller with optional migration controller. indices in a DRAM memory system in terms of channel ID, rank ID, bank ID, row ID, and column ID. However, in our heterogeneity-aware memory controller, as shown in Fig. 3, the Address Translation stage is moved ahead so that each memory access is first routed to either the on-package region or the off-package region in addition to the DRAM indices such as channel ID, rank ID, etc. Then, transaction scheduling is performed for on-package and off-package regions separately, since the transactionlayer optimization for each region is independent of that for the other region. The removal of off-package electrical signaling for onpackage memory is another minor change (see Fig. 3). The Migration Controller is another key component to make the memory controller smart in Fig. 3. It can be either pure hardware-based or OS-assisted depending on the migration granularity. The detailed discussion of its implementation and algorithm will be presented in Section III. In general, the migration controller monitors the recent memory access behavior, reconfigures the physical address routing, and sends out additional memory operations to swap the data across the package boundary. B. Performance Comparison to Larger LLCs Rather than using on-package memory resources as a part of the main memory space, an alternative is to simply use them to expand the LLC capacity or deepen the cache hierarchy. Looking into the well-known average access latency approximation as quoted below, Average access time = Hit time + Miss rate Miss penalty the conventional cache hierarchy design only works effectively when the difference between the Hit time and the Miss penalty is large. When the LLC latency is approaching the off-package main memory latency (see Table II), the relatively small difference between Hit time and Miss penalty does not justify the use of the on-package memory as a cache. Furthermore, Fig. 4 shows that there is almost no benefit to enlarge the LLC capacity in terms of the cache miss rate. While accessing the LLC and the main memory in parallel can help hide the long LLC access latency, there is not enough off-package bandwidth to access the off-package memory speculatively and simultaneously with every reference to an on-package cache memory. Furthermore, off-package references consume

3 Cache miss rate 25% 2% 15% 1% 5% % Fig. 4. Last-level cache capacity (MB) The cache miss rate of different LLC capacities. BT.C CG.C DC.B EP.C FT.C IS.C LU.C MG.C SP.C UA.C significantly more power and should generally be avoided when possible. To validate the concept that it makes more sense to leverage the large on-package memory as a part of the main memory instead of an LLC, we run simulations on Simics [8] and compare the performance of different options using the metric of total IPC. For our quad-core system, our performance evaluation makes use of 4-thread OpenMP version of workloads from NAS Parallel Benchmark (NPB) 3.3. We use CLASS- C as the problem size so that the general memory footprint of the workload is sufficiently large for our purpose. The memory footprints of all the 1 workloads in NPB 3.3 suite are listed in Table I. Note that CLASS-C for the workload DC is not available in the benchmark package and therefore we use CLASS-B instead. The simulation target is an Intel i7-like quad-core processor with private L1/L2 caches and a shared inclusive 8MB, 16- way L3 cache. The latencies of L1, L2, and L3 caches are computed by using CACTI [9] with 45nm technology. The main memory latency is modeled according to Micron DDR datasheet [1]. The average random access memory latency depends on the actual memory reference timing and sequence. By assuming the SiP solution is used as shown in Fig. 1, we set the on-package memory capacity to be 1GB. The on-package and off-package memory latencies are modeled as follows, Off-package latency is the summation of the DRAM core access latency, the memory reference queuing delay, the memory controller traversal delay (including memory controller processing delay and the propagation delay between the CPU core and the memory controller), the package pin delay, and the PCB wiring delay; On-package latency is the summation of the DRAM core access latency, the memory controller traversal delay, the silicon interposer pin delay, and the intra-package wiring delay. Note that the memory reference queuing delay is not included in the on-package latency model since it is almost eliminated by the the increase in the number of onpackage DRAM chip banks and their higher I/O speeds (our later trace-based simulation shows that accessing the off-package 8-bank DRAM causes 17 cycles of queuing delay while accessing the on-package 128-bank DRAM only causes less than 3 cycles on average). In this Simics simulation, we simply model the DRAM core access latency as fixed numbers and set them to be 6-cycle and the queuing delay to be 12-cycle, respectively. In the trace-based simulation we later demonstrate in Section IV, TABLE I THE MEMORY FOOTPRINTS OF NPB 3.3 BENCHMARK SUITE Workload Memory Workload Memory BT.C 76MB CG.C 92MB DC.B 5876MB EP.C 16MB FT.C 5147MB IS.C 164MB LU.C 615MB MG.C 3426MB SP.C 758MB UA.C 51MB TABLE II THE SIMULATION CONFIGURATION OF THE BASELINE PROCESSOR AND THE OPTIONAL ENHANCEMENTS WITH ON-PACKAGE DRAM MODULES Microprocessor Number of cores 4 Frequency 3.2GHz Cache/Memory Hierarchy DL1 and IL1 caches 32KB, 8-way, 2-cycle, private L2 cache 256KB, 8-way, 5-cycle, private L3 cache 8MB, 16-way, 25-cycle, shared Miscellaneous Memory controller 5-cycle for processing Controller-to-core delay 4-cycle each way Package pin delay 5-cycle each way PCB wire delay 11-cycle round-trip Interposer pin delay 3-cycle each way Inter-package delay 1-cycle round-trip DRAM core delay 5-cycle Queuing delay 116-cycle On-package LLC or memory L4 cache 1GB, 15-way, hit 14-cycle, miss 7-cycle On-package memory 1GB, 7-cycle Off-package memory 2-cycle Simulation Cycles Fast-forward Pre-defined breakpoint Warm-up 1 billion instructions Full simulation 1 billion cycles a detailed DRAM timing model with FR-FCFS scheduling policy [11] will be used to obtain more accurate results. The on-package memory is either employed as an L4 shared cache or the on-package memory region of a heterogenous memory space. The detailed architecture configuration is listed in Table II. Note that when using on-package DRAM as LLC, the cache hit time is 2X the DRAM access latency as the tag and the data are accessed sequentially. As illustrated in Fig.5, the simulation result shows that using the extra 1GB on-package DRAM resources to add a new L4 cache can improve the IPC, but in some cases (e.g. CG.C) the performance improvement is limited to.1%. This is because cache capacity misses are not the performance bottleneck beyond a certain capacity threshold, while the increased cache latency starts to offset the benefit and degrade the performance as illustrated in Fig. 4. On the other hand, directly mapping on-package DRAM resources into the main memory space can often achieve better performance. As shown in Fig. 5, for 7 out of the total 1 workloads that have memory footprint of less than 1GB, this strategy is equivalent to having all the memory onpackage. For the other 3 workloads, MG.C has slightly better performance using the on-package memory as a heterogenous memory instead of an L4 cache, and DC.B and FT.C cannot compete against the L4 cache. However, note the simulation result shown in Fig. 5 is only the case of a static mapping. As we will show in the next section, a heterogeneous main memory with dynamic mapping that intelligently migrates

4 1.% 8.% 6.% 4.% 2.%.% Baseline L4 Cache 1GB On-Chip Memory 1GB All Memory On-Chip -2.% BT.C CG.C DC.B EP.C FT.C IS.C LU.C MG.C SP.C UA.C Fig. 5. The IPC comparison among different options of using on-package memory: (a) baseline; (b) Deepen the cache hierarchy with a 1GB DRAM L4 cache; (c) Statically map the first 1GB memory to the on-package DRAM; (d) The ideal case (all the DRAM resources are on-package). data between on-package and off-package regions can further improve the performance and approach the ideal performance as described in Section III. III. DATA MIGRATION USING ON-CHIP MEMORY CONTROLLER As discussed in the previous section, the means of static memory mapping that always keeps the lowest memory address space on-chip only works effectively for the cases where the application memory footprints fit into the on-chip memory. For workloads that need much more memory resources, the performance improvement achieved by static mapping is trivial. For example, the performance improvement of DC.B is only 16% and that of FT.C is only 2.7%. Both of them are less than the ones achieved by using on-package DRAM as LLC. The major reason is the lack of the dynamic data mapping capability. To solve this issue, we propose to add the data migration functionality into the memory controller so that frequentlyused portions of memory can reside on-package with higher probability. Compared to other work on data migration [12] [16]: (1) our data migration is implemented by introducing another layer of address translation; (2) depending on the data granularity, we propose either a pure-hardware implementation or an OS-assisted implementation; and (3) a novel migration algorithm is used to hide the data migration overhead. In this paper, we use the term macro page as the data migration granularity, and the macro page size can be much larger than the page size of 4KB which is used in most operating systems. A. Data Migration Algorithm In this work, our data migration algorithm is based on Hottest-coldest swapping mechanism, which first monitors the LRU (least recently used) on-package macro page (the coldest) and the MRU (most recently used) off-package macro page (the hottest) during the last period of execution and then triggers the memory migration if the off-package MRU page is accessed more frequently than the on-package LRU page after each monitoring epoch. As demonstrated later in the simulation results, it is a simple but effective method to implement the data migration. In this section, we focus on describing the migration algorithm in an incremental way starting from a relatively straightforward design. Basic Design N This is the basic design, in which the on-chip memory controller monitors the LRU on-package macro page and the x x1 x2 48-bit Memory Address Macro Page Index 22 bits Macro Page Macro Page 1 (LRU) Macro Page 2 xff Macro Page N-1 Fig. 6. Translation Table MRU Macro Page N=256 The basic migration management design (N ). MRU off-package macro page, and maintains a translation table. As shown in Fig. 6, assuming the memory space has 48 address bits and the macro page size is 4MB, the lowest 22 bits are the offset address of each macro page, and the highest 26 bits are the macro page IDs. The macro page ID is re-mapped to a new ID through a translation table. Considering the case that on-package memory has the capacity of 1GB, there are N=256 entries in the translation table. The left column in the translation table, as shown in Fig. 6, is actually the row address. Each row represents one on-package memory slot, while the right column represents which macro page is currently located in the on-package memory region. Note that the right column of the translation table is initialized to contain the same value as its left column counterpart so that the lowest 1GB main memory is initially located on-package. In addition, this translation table is bi-directional. For input macro page IDs smaller than N, the translation table works as a RAM; for IDs bigger than N, it works as a CAM. We call this design N because all the N slots in the on-package memory region are utilized. The memory controller always monitors the LRU onpackage macro page (for example, the 2nd row in Fig. 6) and the MRU off-package macro page. During the data migration procedure, the translation table is updated and the right column of the 2nd row in Fig. 6 will store the MRU macro page ID. However, the problem of this basic implementation is that data stored in LRU and MRU pages have to be swapped before the translation table is updated. Since the macro page size is usually large and the off-package DDRx bandwidth is limited, it will halt the execution and incur unacceptable performance overhead. Therefore, techniques to hide data movement latency is necessary. N-1 As an improved design, the N-1, in which one onpackage memory slot is sacrificed, is proposed. As shown in Fig. 7, although there are still N slots, one of them (initially the last slot) does not map to any memory address and it is

5 1 Pending Bit x x1 x2 xff Macro Page Macro Page 1 (LRU) Macro Page 2 Empty Translation Table N-1=255 MRU Macro Page Fig. 7. The improved migration management design sacrificing one onpackage memory slot for pending memory transactions (N-1 ). marked as empty. In practice, empty can be represented by a reserved macro page Ω (e.g., the highest 4MB macro page with ID of x8 in the 8GB memory space 3 ). Therefore, the number of the effective on-package memory slots is N-1. Furthermore, each row in the translation table has an additional P bit, called pending bit, which is initialized to be. When P bit is set, the RAM function of the empty slot is bypassed and the left column is always translated to Ω instead, while the CAM function still works. The characteristics of the translation table are: If macro page n(n < N) is located in the onpackage region, it can only be in the position of the n-th row. Before we explain the swapping algorithm, we define all the macro pages into 5 categories: 1) Original Fast (OF): Macro page whose ID is less than N, and its data is located in the on-package memory region without any address translation. Its address is in the left column of the address table but pointing to itself. 2) Original Slow (OS): Macro page whose ID is greater than N, and its data is located in the off-package memory region without any address translation. Its address is not in the translation table. 3) Migrated Fast (MF): Macro page whose ID is greater than N, but its data is migrated into the on-package memory region. Its address is in the right column of the translation table. 4) Migrated Slow (MS): Macro page whose ID is less than N, but its data is migrated out of the on-package memory region. Its address is in the left column of the translation table and pointing to a new address. 5) Ghost: Macro page whose ID is less than N, but its data is migrated to a reserved macro page in the off-package memory region. Its address is in the left column of the translation table and pointing to the reserved macro page (i.e. x8). The following algorithm describes how to performance a hottest-coldest swap. If the macro page ID of MRU is great than N and the LRU is less than N, it means that MRU is an OS macro page and LRU is an OF one. This is the simplest case. As shown in Fig. 8(a), the first step is to copy data C into the empty slot B. Since the MRU slot C is now in the on-package region, a new link, B-to-C, is updated in the translation table. However, because the link, C-to-B, is not ready yet, the P bit of this row is set to be 1. The second step is to copy data B from the Ghost slot, Ω, to 3 This piece of macro page can be reserved by the hardware driver after booting the OS. slot C. After this step finishes, the P bit is reset. Finally, after copying the LRU data, A, from slot A to slot Ω, slot A becomes the new empty slot. If the macro page ID of MRU is greater than N and the LRU is greater than N, it means that MRU is an OS macro page and LRU is an MF one. Fig. 8(b) shows the data movement in this case. The first two steps are the same as the ones in Fig. 8(a). In the third step, data A, which is currently stored in slot C, is copied to slot Ω, and after that row A in the translation table is remarked as pending. Finally, data C is moved back to its original place, and after that the P bit is reset. During the last step, all the memory accesses to data A are no longer routed to slot C, but accesses to data C are still routed to slot A, because the P bit only prevents the address translation from A to C. If the macro page ID of MRU is less than N and the LRU is less than N, it means that MRU is an MS macro page and LRU is an OF one. In this case, as illustrated in Fig. 8(c), the first step is to copy data D into the empty slot C and update the translation table by adding a new link C-to-D but with the pending bit marked. The second step is to copy data B back to its original slot and set the entry in the translation table to be B-to-B. The third step is the same as the second step in Fig. 8(a) and Fig. 8(b). After that, the P bit is cleared. The fourth step is the same as the third step in Fig. 8(a). If the macro page ID of MRU is less than N and the LRU is greater than N, it means that MRU is an MS macro page and LRU is an MF one. This is the most complicated case because both the MRU and the LRU pages are migrated ones. Fig. 8(d) demonstrates the data movement in this case. Actually, the first 3 steps are the same as the ones in Fig. 8(c) and the last 2 steps are the same as the ones in Fig. 8(b). Example We use the case illustrated in Fig. 8(d) to describe how the coldest on-package data are swapped with the hottest off-package data. As shown in Fig. 8(d), before the swap is triggered, data A and B are already swapped with D and E, respectively. Consequently, A and B are MS pages, while D and E are MF pages. In addition, macro page C is the Ghost page since its data is actually stored in an off-package slot, called Ω. The MSR off-package page is B, and it should be moved back to its original location (slot B) in the on-package memory region. On the contrary, the LSR on-package page is D, and it should be moved back to its original location (slot D) in the off-package memory region. The entire swap procedure is performed in the following order: 1) Data E stored in slot B is moved to slot C, which is empty now. 2) Update the translation table. Change Row C from C-toempty to C-to-E, and set the P bit of Row C. 3) Copy data B back to slot B. After this step, accesses to the MRU macro page, B, are already routed to on-

6 On-chip Region LRU A A empty B C Old Table A A B Empty Off-chip Region C 1 C B 3 2 MRU B A New Table A Empty B C On-chip Region A C LRU empty B D Old Table A C B Empty Off-chip Region C 4 A C D D B 1 2 B A New Table A Empty B D On-chip Region LRU A A MRU B D B 3 1 empty C D Old Table A A B D C Empty On-chip Region LRU A D Off-package Region 2 D B C MRU B E B 3 4 C A New Table A Empty B B C D 1 empty C E Old Table A D B E C Empty Off-package Region 5 D A D 2 E B C 4 MRU 3 C A New Table A Empty B B C E (a) (b) (c) (d) Fig. 8. Migration algorithm: (a) Data migration in the case that MRU N and LRU< N; (b) Data migration in the case that MRU N and LRU N; (c) Data migration in the case that MRU< N and LRU< N; (d) Data migration in the case that MRU< N and LRU N. package regions. 4) Update the translation table. Change Row B from B-to-E to B-to-B. 5) Copy data C from slot Ω to slot E. 6) Update the translation table. Clear the P bit of Row C. 7) Copy data A from slot D to slot Ω. 8) Update the translation table. Set the P bit of Row A. 9) Copy data D from slot A to its original place slot D. 1) Update the translation table. Change Row A from A-to- D to A-to-empty, and clear the P bit. Before this step is finished, accesses to data D are still routed to on-package regions. In general, this algorithm makes sure that during the data migration procedure, the data under movement has two physical locations: one is in the on-package memory region and the other is in the off-package memory region. Thanks to the data duplication and the introduced P bit that blocks uncomplete bi-directional mapping, the program execution will not be halted since all the memory accesses are routed to an available physical location. By using this data movement algorithm, after the first step in Fig. 8(a) and Fig. 8(b) is completed or the first two steps in Fig. 8(c) and Fig. 8(d) are completed, the MRU macro page, which is previously stored in the offpackage memory region, can start to experience fast access speed provided by the on-package memory. In addition, until the last step (in all four cases) is completed, the LRU macro page can still be fast accessed since it still has a copy in the on-package memory. N-1 with Live Migration Although the N-1 hides the migration latency by conservatively accessing the MRU macro page with off-package memory speed during the migration, we can further improve the algorithm by introducing the concept of critical-data-first. In the pure N-1, it takes time to bring the MRU page on chip. For example, when the macro page size is 4MB and the off-package interface is DDR3-1333, it takes 374µs to finish the first step as described in the algorithm of N-1. During this 374µs, all the following accesses to the MRU page are still routed to off-package memory and slow. To address this issue, we divide the large data transaction into smaller chunks, such as 4KB. By doing so, the on-package memory can supply data to any request to the 4KB sub-blocks that have been already transferred while the rest of the swap is still ongoing in background. 1 Pending Bit x x1 x2 xff Filling Bit Macro Page Macro Page 1 (LRU) Macro Page 2 Empty Translation Table M = 124 N-1=255 MRU Macro Page Fig. 9. The improved version of N-1 with live migration support: The additional F bit indicates that the corresponding on-package slot is under data movement; the associated bit map indicates which sub-block is ready for accessing. To implement this concept in hardware, we add another bit to each row in the translation table and a separated bit map. As shown in Fig. 9, the newly-added bit is called F bit (Filling bit). When the F bit is set, it means the corresponding onpackage slot is loading data from off-package memory and this on-package slot is partially available. The additional bit map indicates which sub-block has already been moved to the on-package slot. When the F bit is set to 1, all the bits in the bit map are reset to. However, when all the bits in the bit map become 1, the F bit is reset to representing the data loading is completed.for example, if we use the macro page size of 4MB and sub-block size of 4KB, then there are 1,24 bits in the bit map. To support critical-data-first, the memory controller starts to copy the macro page from the position of the MRU sub-block and then wraps the address to the beginning. By using this method, the migration overhead is further reduced. In Section IV, we compare the performance improvement achieved by the basic N design, the N-1 design, and this improved design, which we call Live Migration. B. Data Migration Implementation The migration algorithm can be implemented in either purehardware way or OS-assisted way depending on the migration granularity. Basically, the functionality of the data migration is achieved by keeping an extra layer of address translation that maps the physical address to the actual machine address. The pure-hardware scheme keeps the translation table in hardware while the OS-assisted scheme keeps it in software. The pure-hardware solution is preferred when the macro page size is relatively large so that the scale of macro page

7 TABLE III SIMULATION PARAMETERS AND WORKLOAD/TRACE DESCRIPTIONS Memory system parameters Total memory capacity 4GB Macro page size from 4KB to 4MB On-package memory capacity 512MB Sub-block size 4KB Workloads FT.C FT contains the computational kernel of a 3D FFT-based spectral method. MG.C MG uses a V-cycle MultiGrid method to compute the solution of a 3D scalar Poisson equation. SPEC26 Mixture The combination of four SPEC26 workloads: gcc, mcf, perl, and zeusmp. pgbench TPC-B like benchmark running PostgreSQL 8.3 with pgbench and a scaling factor of 1. Indexer Nutch.9.1 indexer, Sun JDK 1.6. and HDFS hosted on one hard drive. SPECjbb 4 copies of SPECjbb 25, each with 16 warehouses, using Sun JDK Hardware overhead (Bits) 1.E+7 1.E+6 1.E+5 1.E+4 1.E Macro page size (KB) Fig. 1. The hardware overhead to manage 1GB on-package memory at different data migration granularity. count is controllable within certain hardware resources. On the other hand, if finer granularity of data migration is required, the number of macro pages becomes too large for hardware to handle and OS-assisted scheme is used to track the access information of each macro page. Pure-Hardware Solution If the data migration is implemented by pure hardware, the major overhead comes from the translation table. In the case that the on-package memory size is 1GB and the macro page size is 4MB, the right column takes 26 bits per entry. Therefore, the total number of bits per entry is 28, including the P bit and the F bit. The cost of the entire translation table with 256 entries is 7,168 bits, and it is comparable to a TLB design in the conventional microprocessors. Since our proposed translation table has both RAM and CAM functions, we conservatively assume that it takes 2 additional clock cycles to complete one address translation. Besides the translation table, there are two bit maps also incur the hardware overhead. The first bit map is used to store the filling status of each sub-block. In the 4MB macro page case, its size is 1,24 bits. The second bit map is used to record the LRU macro page with clock-based pseudo-lru algorithm, which is used in real microprocessor implementation [17], and its size is 256 bits. The MRU policy is approximately implemented by multi-queue algorithm [18], by which we use three-level of queue with ten entries per level. Thus, the size of multi-queue is 78 bits. Hence, in total, the proposed purehardware scheme needs 9,228 bits to manage 1GB on-package memory at the 4MB granularity level, which is just a small overhead in today s microprocessor designs. Fig. 1 shows the number of bits required by the pure-hardware solution increases rapidly when reducing the macro page size. In this paper, we consider the pure-hardware solution is only feasible for the granularity larger than 1MB. OS-Assisted Implementation When the fined-granularity data migration is used, the physical-to-machine address translation has to be managed by software since the pure-hardware scheme has too much overhead as shown in Fig. 1. In the OS-assisted scheme, the data structure involved in the aforementioned algorithm is kept by the operating system, while the counters (i.e. the pseudo-lru counter to determine the on-package LRU macro page and the multi-queue counter to determine the off-package MRU macro page) are still implemented in hardware. The translation table is updated by an OS periodical routine and the OS sends out data swapping instruction after each update. Besides the data movement latency, the extra overhead caused by OS involvement is on the same level of conventional TLB update overhead, which is mainly consumed by user-kernel mode switching and is about 127 cycles [19]. IV. TRACE-BASED ANALYSIS ON DYNAMIC MIGRATION To evaluate the effectiveness of our proposed hottest-coldest swap, we compare the performance of N, N-1, and N-1 with Live Migration designs. We evaluate different methods in a conventional system primarily using trace-based simulation. In this section, we use traces rather than a detailed full-system simulator as we used in Section II because tracebased simulation makes it practical to process trillions of main memory accesses. We collected the memory trace from a detailed full-system simulator [2] and the trace file records the physical address, CPU ID, time stamp, and read/write status of all main memory accesses. The workloads used to evaluate our designs includes FT.C and MG.C, which have large memory footprint as we have shown. Additionally, we developed a multi-programmed workload, SPEC26 Mixture, by combining the traces from gcc, mcf, perl, and zeusmp. Furthermore, besides computationintensive workloads, we also consider transaction-intensive workloads as well. Therefore, we add traditional server benchmarks (pgbench and SPECjbb25) and a mixture of Web 2.-based benchmarks (indexer). All of these workloads have memory footprints larger than 2GB. In order to demonstrate how our proposed memory controller effectively leverage the on-package memory region, we limit the capacity of on-package memory space to be 512MB. As for the other parameters, we keep most of the assumptions that we have used in the previous sections. However, in this trace-based simulation, we model the detailed DRAM access latency by assuming FR-FCFS [11] scheduling policy and

8 TABLE IV THE EFFECTIVENESS OF THE PROPOSED MEMORY CONTROLLER-BASED DATA MIGRATION IN REDUCING THE AVERAGE MEMORY ACCESS LATENCY. Workload FT.C MG.C pgbench indexer SPECjbb SPEC26 Mixture DRAM core latency (cycles) Latency w/o migration (cycles) Best latency w/ migration (cycles) Effectiveness 69.1% 84.3% 92.2% 86.1% 72.2% 99.1% Average memory latency (cycles) Average latency (all off-package memory) Average latency (all on-package memory) N N-1 Average latency (w/o migration) Live N N-1 Live N N-1 Live N N-1 Live N N Live N N-1 Live 4K 16K 64K 256K 1M 4M OSassisted Purehardware FT MG pgbench indexer SPECjbb SPEC26 Mixture (a) Average memory latency (cycles) N N-1 Live N N-1 Live N N Live N N-1 Live N N Live N N-1 Live 4K 16K 64K 256K 1M 4M OSassisted Purehardware FT MG pgbench indexer SPECjbb SPEC26 Mixture Average memory latency (cycles) N N Live N N-1 Live N N-1 (b) Live N N-1 Live N N Live N N-1 Live 4K 16K 64K 256K 1M 4M OSassisted Purehardware FT MG pgbench indexer SPECjbb SPEC26 Mixture (c) Fig. 11. The average memory access latency of various workloads by using different methods of data swapping: (a) swapping interval = 1K memory accesses; (b) swapping interval = 1K memory accesses; (c) the swapping interval = 1K memory accesses. open page access. We use 8-bank structure for the off-package DRAM and 128-bank structure for the on-package DRAM. Therefore, the simulated DRAM access latency depends on the memory reference sequence, read/write distribution, row/bank locality, and whether the access is routed to the on-package or off-package region. In addition, the macro page granularity under the memory controller management ranges from 4KB to 4MB, and the sub-block size for live migration is 4KB. Due to the hardware overhead concern, OS-assisted scheme is used for macro pages smaller than 1MB and pure-hardware scheme is used for macro pages larger than 1MB (including 1MB). Swap interval is another parameter that has a large impact on the effectiveness of our proposed hardware-based data migration scheme. Therefore, we run the trace-based simulation by varying the swapping interval number. We compare the random memory access latency on average by triggering the swap operation after each 1,, 1,, and 1, memory accesses, respectively. When the data swap activity becomes too frequent, the existence of P bit and F bit prevents triggering another swap if the previous swap is not complete yet. A. N, N-1, and Live Migration Fig. 11 shows the effective memory access latencies of using different swapping intervals and the comparison among different migration algorithms. Observing the simulation result when the data granularity is large (i.e. 4MB), it is obvious to see that the straightforward method, N, is not practical because it needs to move a large chunk of data without hiding

9 Memory latency (cycles) KB 16Kb 64KB 256KB 1MB 4MB FT MG pgbench indexer SPECjbb SPEC26 Fig. 12. The average memory latency (swapping interval = 1K memory accesses). any latency. Hence, in this design, when the swap frequency is high (i.e. once per 1K or 1K memory accesses), the migration overhead offsets the benefit. Therefore, in order to hide the migration overhead, N-1, which sacrifices one slot in the translation table, should be used to overlap data migration with computation. In addition, live migration, which cuts the macro page into smaller pieces, can further hide the migration overhead by finer-granularity overlapping and reduce the average memory access latency by 5.2%. However, when the data migration granularity is small (i.e. 4KB), the difference among N, N-1, and live migration becomes negligible. This is because the overhead of swapping two small macro pages is trivial and N has one more slots in the translation table as its name implies. B. Migration Granularity and Frequency The migration granularity and the migration frequency are the two key factors affecting the effectiveness of the proposed heterogenous main memory space. To study the impact of these two factors, we use live migration algorithm with different macro page sizes and swapping intervals, and Fig. 12 to Fig. 14 illustrate the result. The comparison shows that the migration frequency is more important since the minimum memory latencies in Fig. 12 are smaller than those in Fig. 13 and Fig. 14. Another observation is that different types of workloads have different favorites on the migration granularity and the optimal migration granularity also depends on the migration frequency. While the live migration can make it feasible to migrate data across the package boundary without incurring too much overhead, it is necessary for the memory controller to adaptively change the migration granularity according to different types of workloads. In this work, we propose to use an OS-assisted approach for fine-granularity and a pure-hardware approach for coarse-granularity as described in the previous sections. In order to demonstrate the effectiveness of the proposed heterogenous memory space and its management algorithms in a more generalized way, we define the effectiveness, η, as follows, η = Latency w/o migration Latency w/ migration Latency w/o migration DRAM core latency 1% And, this metric approximately reflects how many memory accesses are routed to the on-package memory region. Note that the average effectiveness is 83% as listed in Table IV, while the on-package memory size is 512MB, which Memory latency (cycles) KB 16Kb 64KB 256KB 1MB 4MB FT MG pgbench indexer SPECjbb SPEC26 Fig. 13. The average memory latency (swapping interval = 1K memory accesses). 4KB 16Kb 64KB 256KB 1MB 4MB Memory latency (cycles) FT MG pgbench indexer SPECjbb SPEC26 Fig. 14. The average memory latency (swapping interval = 1K memory accesses). is only 12.5% of the 4GB memory space in this case. C. Sensitivity Analysis on On-Package Memory Capacity In order to further demonstrate how our proposed heterogeneous main memory with migration feature enabled can efficiently leverage the on-package memory resources, we conduct a sensitivity analysis by reducing the on-package memory capacity from 512MB to 128MB. As expected, the average memory access latency is increased because it becomes harder to keep as many as data on package when the on-package memory size is reduced. However, as illustrated in Fig. 15, the average latency is still much shorter than the latency without dynamic data migration, while the on-package memory capacity is reduced from 512MB to 128MB. D. Power Evaluation Fig. 16 shows the total memory power comparison between using hybrid on- and off-package DRAM with dynamic migration and only using off-package DRAM. We assume 5pJ/bit for both on- and off-package DRAM core access, 1.66pJ/bit for on-package interconnect, and 13pJ/bit for off-package interconnect [21]. The memory power overhead caused by crossing-package migration depends on the migration interval. The minimum power overhead we observe is about 2X, which occurs when the migration interval is once per 1K memory accesses and the migration granularity is 4KB. V. RELATED WORK A large body of prior work has examined how to use on-chip memory controllers and data migration schemes to improve the system performance. A. On-Chip Memory Controller The recent trends with the inclusion of on-chip memory controllers in modern microprocessor designs help eliminate

10 Average memory latency (cycles) M 256M 512M 128M 256M 512M 128M 256M 512M 128M 256M 512M 128M 256M 512M 128M 256M 512M DRAM core latency Average latency w/ migration Average latency w/o migration Fig. 15. FT MG pgbench indexer SPECjbb SPEC26 Mixture The average memory access latency of the heterogeneous main memory under different sizes of on-package memory. Normalized Power (Compared to off-package DRAM-only solution) FT MG pgbench indexer SPECjbb SPEC K 1K 1K 1K 1K 1K 1K 1K 1K Size = 4KB Size = 16KB Size = 64KB Fig. 16. The relative memory power when using heterogenous on- and offpackage DRAM with dynamic migration and only using off-package DRAM. the system-bus bottleneck and enable high-performance interface between a CPU and its DRAM DIMMs [22]. In addition, integrating the memory controller on chip also makes it possible to add an extra level of address indirection to remap the data structure in memory and to improve cache and bus utilization [23]. However, our work proposes a system architecture where the memory controller is tightly integrated with TLBs, which is seldom available in the contemporary computer market. Many previous DRAM scheduling policies were proposed to improve the memory controller throughput [24], [25]. Stream pre-fetchers are also commonly used in many processors [26], [27] since they do not require significant hardware cost and work well for a larger number of applications. In addition, prefetch-aware DRAM controller was proposed [28] to maximize the benefit and minimize the harm of prefetching. Our work is orthogonal to these mechanisms: while aggressive prefetching is effective in tolerating memory latency by moving more useful data to caches, our work focuses on leveraging fast regions in the heterogeneous memory space to hide the impact of low performance regions. B. Non-Uniform Access NUMA (non-uniform memory access) architecture is used in many commercial settings of SMP (symmetric multiprocessing) clusters, such as those based on AMD Opteron and Alpha EV7 processors. In the ccnuma architecture [29], each processor or processor cluster maintains a cache system to reduce internode traffic and average latency to non-local data. The Fully-Buffered DIMM [3] is another example of NUMA which can be configured to behave as a uniform architecture by forcing all DIMMs to stall and emulate the slowest one. For large caches where interconnect latency dominates the total access latency, a partitioned Non-Uniform Cache Access (NUCA) architecture is adopted [31]. To improve the proximity of data and computation, dynamic-nuca policies have been actively studied in recent years [12] [14], [32]. Although dynamic-nuca method helps performance, the complexity of the search mechanism together with the high cost of communication makes it not scalable. Our proposed heterogeneous main memory can also be treated as a NUMA system. Although on-chip and off-chip memory regions have different access speeds, both of them are local to the accessing processor, hence cache coherence is not an issue in this scenario. The previous research closest to our work is from Ekman and Stenstrom [33], in which a multi-level exclusive main memory subsystem is proposed. In their work, pages can be gradually paged out to disk through multiple levels of memory regions, but proactive data migration from lower-tier to higher-tier memory regions is not allowed. Different from most of the previous work, our approach introduces a new layer of address translation that is handled by the on-chip memory controller. VI. CONCLUSION SiP and 3D integration are promising to bring more memory cells onto microprocessor package to mitigate the memory wall problem. In this paper, instead of using them as caches, we studied the architecture of using the on-package memory cells as a portion of the main memory working in partnership with a conventional main memory implemented with DIMMs. To manage this heterogeneous main memory containing data on both on-package and off-package regions, we introduced another layer of address translation. While the conventional physical address makes the OS paging system keep intact, the newly-added machine address represents the actual data location on the DRAM chips. By manipulating the physicalto-machine translation, our proposed on-package memory controller design can dynamically migrate data across the chip boundary. Compared to using on-package memory as caches, the heterogenous main memory approach does not pay an extra penalty for tag access before data access and for cache misses. The evaluation results demonstrate how the heterogeneous main memory can use the on-package memory efficiently and achieve the effectiveness of 83% on average.

11 REFERENCES [1] B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang et al., Die Stacking (3D) Microarchitecture, in MICRO 6, 26, pp [2] G. H. Loh, Y. Xie, and B. Black, Processor Design in 3D Die-Stacking Technologies, IEEE Micro, vol. 27, no. 3, pp , 27. [3] International Technology Roadmap for Semiconductors, ITRS 29 Edition, [4] T. Kgil, A. Saidi, N. Binkert, S. Reinhardt, K. Flautner et al., PicoServer: Using 3D stacking technology to build energy efficient servers, ACM Journal on Emerging Technologies in Computing Systems, vol. 4, no. 4, pp. 1 34, 28. [5] G. H. Loh, 3D-Stacked Memory Architectures for Multi-core Processors, in ISCA 8, 28, pp [6] G. L. Loi, B. Agrawal, N. Srivastava, S.-C. Lin, T. Sherwood et al., A Thermally-Aware Performance Analysis of Vertically Integrated (3-D) Processor-Memory Hierarchy, in DAC 6, 26, pp [7] M. Ghosh and H.-H. S. Lee, Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die- Stacked DRAMs, in MICRO 7, 27, pp [8] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg et al., Simics: A Full System Simulation Platform, Computer, vol. 35, no. 2, pp. 5 58, 22. [9] S. Thoziyoor, J. H. Ahn, M. Monchiero, J. B. Brockman, and N. P. Jouppi, A Comprehensive Memory ling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies, in ISCA 8. IEEE Computer Society, 28, pp [1] Micron, 2Gb: x4, x8, x16 DDR3 SDRAM, [11] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, Memory Access Scheduling, in ISCA, 2, pp [12] B. M. Beckmann, M. R. Marty, and D. A. Wood, ASR: Adaptive Selective Replication for CMP Caches, in MICRO 6, 26, pp [13] Z. Chishti, M. D. Powell, and T. N. Vijaykumar, Optimizing Replication, Communication, and Capacity Allocation in CMPs, in ISCA 5, 25, pp [14] J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger et al., A NUCA Substrate for Flexible CMP Cache Sharing, in ICS 5, 25, pp [15] N. Rafique, W.-T. Lim, and M. Thottethodi, Architectural Support for Operating System-Driven CMP Cache Management, in PACT 6, 26, pp [16] S. Cho and L. Jin, Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, in MICRO 6, 26, pp [17] UltraSPARC T2. Supplement to the UltraSPARC Architecture 27, Sun Microsystems, Inc., Tech. Rep , 27. [18] G. Loh, Extending the Effectiveness of 3D-Stacked DRAM Caches with an Adaptive Multi-Queue Policy, in MICRO 9, 29, pp [19] J. Liedtke, Improving IPC by kernel design, in SOSP 93, 1993, pp [2] E. Argollo, A. Falcón, P. Faraboschi, M. Monchiero, and D. Ortega, COTSon: Infrastructure for Full System Simulation, ACM SIGOPS Operating Systems Review, vol. 43, no. 1, pp , 29. [21] K. Fukuda, H. Yamashita, G. Ono, R. Nemoto, E. Suzuki, T. Takemoto, F. Yuki, and T. Saito, A 12.3mW 12.5Gb/s Complete Transceiver in 65nm CMOS, in ISSCC 1, 21, pp [22] W.-F. Lin, S. K. Reinhardt, and D. Burger, Reducing DRAM Latencies with an Integrated Memory Hierarchy Design, in HPCA 1, 21, pp [23] L. Zhang, Z. Fang, M. Parker, B. K. Mathew, L. Schaelicke et al., The Impulse Memory Controller, IEEE Transactions on Computers, vol. 5, no. 11, pp , 21. [24] Z. Zhu and Z. Zhang, A Performance Comparison of DRAM Memory System Optimizations for SMT Processors, in HPCA 5, 25, pp [25] E. Ipek, O. Mutlu, J. F. Martínez, and R. Caruana, Self-Optimizing Memory Controllers: A Reinforcement Learning Approach, in ISCA 8, 28, pp [26] H. Q. Le, W. J. Starke, J. S. Fields, F. O Connell et al., IBM POWER6 Microarchitecture, IBM Journal of Research and Developement, vol. 51, no. 6, pp , 27. [27] S. Sharma, J. G. Beu, and T. M. Conte, Spectral Prefetcher: An Effective Mechanism for L2 Cache Prefetching, ACM Transactions on Architecture and Code Optimization, vol. 2, no. 4, pp , 25. [28] C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt, Prefetch-Aware DRAM Controllers, in MICRO 8, 28, pp [29] J. Laudon and D. Lenoski, The SGI Origin: A ccnuma Highly Scalable Server, in ISCA 97, 1997, pp [3] B. Ganesh, A. Jaleel, D. Wang, and B. Jacob, Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, in HPCA 7, 27, pp [31] C. Kim, D. Burger, and S. W. Keckler, An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches, in ASP- LOS 2, 22, pp [32] J. Chang and G. S. Sohi, Cooperative Caching for Chip Multiprocessors, in ISCA 6, 26, pp [33] M. Ekman and P. Stenstrom, A Cost-Effective Main Memory Organization for Future Servers, in IPDPS 5, 25, pp

Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi

Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi University of Utah & HP Labs 1 Large Caches Cache hierarchies

More information

Intel Xeon 5500 Memory Performance Ganesh Balakrishnan System x and BladeCenter Performance

Intel Xeon 5500 Memory Performance Ganesh Balakrishnan System x and BladeCenter Performance Intel Memory Performance Ganesh Balakrishnan System x and BladeCenter Performance 1 ABSTRACT...3 1.0 INTRODUCTION...3 2.0 SYSTEM ARCHITECTURE...4 2.1 HS22...4 2.2 X3650 M2, X3550 M2, IDATAPLEX 2.0...5

More information

OpenSPARC T1 Processor

OpenSPARC T1 Processor OpenSPARC T1 Processor The OpenSPARC T1 processor is the first chip multiprocessor that fully implements the Sun Throughput Computing Initiative. Each of the eight SPARC processor cores has full hardware

More information

Optimizing Shared Resource Contention in HPC Clusters

Optimizing Shared Resource Contention in HPC Clusters Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs

More information

Chapter 6. 6.1 Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig. 6.1. I/O devices can be characterized by. I/O bus connections

Chapter 6. 6.1 Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig. 6.1. I/O devices can be characterized by. I/O bus connections Chapter 6 Storage and Other I/O Topics 6.1 Introduction I/O devices can be characterized by Behavior: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

Inter-socket Victim Cacheing for Platform Power Reduction

Inter-socket Victim Cacheing for Platform Power Reduction In Proceedings of the 2010 International Conference on Computer Design Inter-socket Victim Cacheing for Platform Power Reduction Subhra Mazumdar University of California, San Diego smazumdar@cs.ucsd.edu

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Department of Electrical Engineering Princeton University {carolewu, mrm}@princeton.edu Abstract

More information

Distance-Aware Round-Robin Mapping for Large NUCA Caches

Distance-Aware Round-Robin Mapping for Large NUCA Caches Distance-Aware Round-Robin Mapping for Large NUCA Caches Alberto Ros, Marcelo Cintra, Manuel E. Acacio and José M.García Departamento de Ingeniería y Tecnología de Computadores, Universidad de Murcia,

More information

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1 Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System

More information

A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems*

A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems* A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems* Junho Jang, Saeyoung Han, Sungyong Park, and Jihoon Yang Department of Computer Science and Interdisciplinary Program

More information

Computer Systems Structure Main Memory Organization

Computer Systems Structure Main Memory Organization Computer Systems Structure Main Memory Organization Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Storage/Memory

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures 11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

Windows Server Performance Monitoring

Windows Server Performance Monitoring Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly

More information

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU, SOORAJ PUTHOOR, BRADFORD M BECKMANN, MARK D HILL*, STEVEN K REINHARDT, DAVID A WOOD* *University of

More information

OC By Arsene Fansi T. POLIMI 2008 1

OC By Arsene Fansi T. POLIMI 2008 1 IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5

More information

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?

More information

Low Power AMD Athlon 64 and AMD Opteron Processors

Low Power AMD Athlon 64 and AMD Opteron Processors Low Power AMD Athlon 64 and AMD Opteron Processors Hot Chips 2004 Presenter: Marius Evers Block Diagram of AMD Athlon 64 and AMD Opteron Based on AMD s 8 th generation architecture AMD Athlon 64 and AMD

More information

AMD Opteron Quad-Core

AMD Opteron Quad-Core AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced

More information

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Oracle Database Scalability in VMware ESX VMware ESX 3.5 Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises

More information

Virtuoso and Database Scalability

Virtuoso and Database Scalability Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of

More information

1 Storage Devices Summary

1 Storage Devices Summary Chapter 1 Storage Devices Summary Dependability is vital Suitable measures Latency how long to the first bit arrives Bandwidth/throughput how fast does stuff come through after the latency period Obvious

More information

Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.

Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier. Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core

More information

Enabling Technologies for Distributed Computing

Enabling Technologies for Distributed Computing Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing, UNF Multi-core CPUs and Multithreading Technologies

More information

Muse Server Sizing. 18 June 2012. Document Version 0.0.1.9 Muse 2.7.0.0

Muse Server Sizing. 18 June 2012. Document Version 0.0.1.9 Muse 2.7.0.0 Muse Server Sizing 18 June 2012 Document Version 0.0.1.9 Muse 2.7.0.0 Notice No part of this publication may be reproduced stored in a retrieval system, or transmitted, in any form or by any means, without

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

Chapter 11 I/O Management and Disk Scheduling

Chapter 11 I/O Management and Disk Scheduling Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 11 I/O Management and Disk Scheduling Dave Bremer Otago Polytechnic, NZ 2008, Prentice Hall I/O Devices Roadmap Organization

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Proposed Multi Level Cache Models Based on Inclusion & Their Evaluation

Proposed Multi Level Cache Models Based on Inclusion & Their Evaluation International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue7- July 213 Proposed Multi Level Cache Based on Inclusion & Their Evaluation Megha Soni #1, Rajendra Nath *2 #1 M.Tech.,

More information

Symmetric Multiprocessing

Symmetric Multiprocessing Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called

More information

Memory Architecture and Management in a NoC Platform

Memory Architecture and Management in a NoC Platform Architecture and Management in a NoC Platform Axel Jantsch Xiaowen Chen Zhonghai Lu Chaochao Feng Abdul Nameed Yuang Zhang Ahmed Hemani DATE 2011 Overview Motivation State of the Art Data Management Engine

More information

Motivation: Smartphone Market

Motivation: Smartphone Market Motivation: Smartphone Market Smartphone Systems External Display Device Display Smartphone Systems Smartphone-like system Main Camera Front-facing Camera Central Processing Unit Device Display Graphics

More information

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Using Synology SSD Technology to Enhance System Performance Synology Inc. Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_SSD_Cache_WP_ 20140512 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges...

More information

3D Cache Hierarchy Optimization

3D Cache Hierarchy Optimization 3D Cache Hierarchy Optimization Leonid Yavits, Amir Morad, Ran Ginosar Department of Electrical Engineering Technion Israel Institute of Technology Haifa, Israel yavits@tx.technion.ac.il, amirm@tx.technion.ac.il,

More information

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...

More information

Performance Impacts of Non-blocking Caches in Out-of-order Processors

Performance Impacts of Non-blocking Caches in Out-of-order Processors Performance Impacts of Non-blocking Caches in Out-of-order Processors Sheng Li; Ke Chen; Jay B. Brockman; Norman P. Jouppi HP Laboratories HPL-2011-65 Keyword(s): Non-blocking cache; MSHR; Out-of-order

More information

Thread level parallelism

Thread level parallelism Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process

More information

Implementation of Buffer Cache Simulator for Hybrid Main Memory and Flash Memory Storages

Implementation of Buffer Cache Simulator for Hybrid Main Memory and Flash Memory Storages Implementation of Buffer Cache Simulator for Hybrid Main Memory and Flash Memory Storages Soohyun Yang and Yeonseung Ryu Department of Computer Engineering, Myongji University Yongin, Gyeonggi-do, Korea

More information

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 550):

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 550): Review From 550 The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 550): Motivation for The Memory Hierarchy: CPU/Memory Performance Gap The Principle Of Locality Cache Basics:

More information

Price/performance Modern Memory Hierarchy

Price/performance Modern Memory Hierarchy Lecture 21: Storage Administration Take QUIZ 15 over P&H 6.1-4, 6.8-9 before 11:59pm today Project: Cache Simulator, Due April 29, 2010 NEW OFFICE HOUR TIME: Tuesday 1-2, McKinley Last Time Exam discussion

More information

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

More information

Energy Constrained Resource Scheduling for Cloud Environment

Energy Constrained Resource Scheduling for Cloud Environment Energy Constrained Resource Scheduling for Cloud Environment 1 R.Selvi, 2 S.Russia, 3 V.K.Anitha 1 2 nd Year M.E.(Software Engineering), 2 Assistant Professor Department of IT KSR Institute for Engineering

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

Q & A From Hitachi Data Systems WebTech Presentation:

Q & A From Hitachi Data Systems WebTech Presentation: Q & A From Hitachi Data Systems WebTech Presentation: RAID Concepts 1. Is the chunk size the same for all Hitachi Data Systems storage systems, i.e., Adaptable Modular Systems, Network Storage Controller,

More information

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency WHITE PAPER Solving I/O Bottlenecks to Enable Superior Cloud Efficiency Overview...1 Mellanox I/O Virtualization Features and Benefits...2 Summary...6 Overview We already have 8 or even 16 cores on one

More information

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers The Orca Chip... Heart of IBM s RISC System/6000 Value Servers Ravi Arimilli IBM RISC System/6000 Division 1 Agenda. Server Background. Cache Heirarchy Performance Study. RS/6000 Value Server System Structure.

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

theguard! ApplicationManager System Windows Data Collector

theguard! ApplicationManager System Windows Data Collector theguard! ApplicationManager System Windows Data Collector Status: 10/9/2008 Introduction... 3 The Performance Features of the ApplicationManager Data Collector for Microsoft Windows Server... 3 Overview

More information

TECHNOLOGY BRIEF. Compaq RAID on a Chip Technology EXECUTIVE SUMMARY CONTENTS

TECHNOLOGY BRIEF. Compaq RAID on a Chip Technology EXECUTIVE SUMMARY CONTENTS TECHNOLOGY BRIEF August 1999 Compaq Computer Corporation Prepared by ISSD Technology Communications CONTENTS Executive Summary 1 Introduction 3 Subsystem Technology 3 Processor 3 SCSI Chip4 PCI Bridge

More information

Multi-core Systems What can we buy today?

Multi-core Systems What can we buy today? Multi-core Systems What can we buy today? Ian Watson & Mikel Lujan Advanced Processor Technologies Group COMP60012 Future Multi-core Computing 1 A Bit of History AMD Opteron introduced in 2003 Hypertransport

More information

LS DYNA Performance Benchmarks and Profiling. January 2009

LS DYNA Performance Benchmarks and Profiling. January 2009 LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The

More information

Maximizing Your Server Memory and Storage Investments with Windows Server 2012 R2

Maximizing Your Server Memory and Storage Investments with Windows Server 2012 R2 Executive Summary Maximizing Your Server Memory and Storage Investments with Windows Server 2012 R2 October 21, 2014 What s inside Windows Server 2012 fully leverages today s computing, network, and storage

More information

Enabling Technologies for Distributed and Cloud Computing

Enabling Technologies for Distributed and Cloud Computing Enabling Technologies for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Multi-core CPUs and Multithreading

More information

HP Z Turbo Drive PCIe SSD

HP Z Turbo Drive PCIe SSD Performance Evaluation of HP Z Turbo Drive PCIe SSD Powered by Samsung XP941 technology Evaluation Conducted Independently by: Hamid Taghavi Senior Technical Consultant June 2014 Sponsored by: P a g e

More information

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Measuring Cache and Memory Latency and CPU to Memory Bandwidth White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary

More information

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

More information

Measuring Cache Performance

Measuring Cache Performance Measuring Cache Performance Components of CPU time Program execution cycles Includes cache hit time Memory stall cycles Mainly from cache misses With simplifying assumptions: Memory stall cycles = = Memory

More information

Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar

Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar Memory ICS 233 Computer Architecture and Assembly Language Prof. Muhamed Mudawar College of Computer Sciences and Engineering King Fahd University of Petroleum and Minerals Presentation Outline Random

More information

Multicore Architectures

Multicore Architectures Multicore Architectures Week 1, Lecture 2 Multicore Landscape Intel Dual and quad-core Pentium family. 80-core demonstration last year. AMD Dual, triple (?!), and quad-core Opteron family. IBM Dual and

More information

Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support

Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support Ranjit Noronha and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) The Ohio State University Outline

More information

Networking Virtualization Using FPGAs

Networking Virtualization Using FPGAs Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,

More information

Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM

Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM White Paper Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM September, 2013 Author Sanhita Sarkar, Director of Engineering, SGI Abstract This paper describes how to implement

More information

Resource Efficient Computing for Warehouse-scale Datacenters

Resource Efficient Computing for Warehouse-scale Datacenters Resource Efficient Computing for Warehouse-scale Datacenters Christos Kozyrakis Stanford University http://csl.stanford.edu/~christos DATE Conference March 21 st 2013 Computing is the Innovation Catalyst

More information

SAN Conceptual and Design Basics

SAN Conceptual and Design Basics TECHNICAL NOTE VMware Infrastructure 3 SAN Conceptual and Design Basics VMware ESX Server can be used in conjunction with a SAN (storage area network), a specialized high speed network that connects computer

More information

White Paper. Recording Server Virtualization

White Paper. Recording Server Virtualization White Paper Recording Server Virtualization Prepared by: Mike Sherwood, Senior Solutions Engineer Milestone Systems 23 March 2011 Table of Contents Introduction... 3 Target audience and white paper purpose...

More information

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique Jyoti Malhotra 1,Priya Ghyare 2 Associate Professor, Dept. of Information Technology, MIT College of

More information

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS AN INSTRUCTION WINDOW THAT CAN TOLERATE LATENCIES TO DRAM MEMORY IS PROHIBITIVELY COMPLEX AND POWER HUNGRY. TO AVOID HAVING TO

More information

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance M. Rangarajan, A. Bohra, K. Banerjee, E.V. Carrera, R. Bianchini, L. Iftode, W. Zwaenepoel. Presented

More information

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f. one large disk) Parallelism improves performance Plus extra disk(s) for redundant data storage Provides fault tolerant

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

Configuring Memory on the HP Business Desktop dx5150

Configuring Memory on the HP Business Desktop dx5150 Configuring Memory on the HP Business Desktop dx5150 Abstract... 2 Glossary of Terms... 2 Introduction... 2 Main Memory Configuration... 3 Single-channel vs. Dual-channel... 3 Memory Type and Speed...

More information

Using Synology SSD Technology to Enhance System Performance. Based on DSM 5.2

Using Synology SSD Technology to Enhance System Performance. Based on DSM 5.2 Using Synology SSD Technology to Enhance System Performance Based on DSM 5.2 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges... 3 SSD Cache as Solution...

More information

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies Kurt Klemperer, Principal System Performance Engineer kklemperer@blackboard.com Agenda Session Length:

More information

Azul Compute Appliances

Azul Compute Appliances W H I T E P A P E R Azul Compute Appliances Ultra-high Capacity Building Blocks for Scalable Compute Pools WP_ACA0209 2009 Azul Systems, Inc. W H I T E P A P E R : A z u l C o m p u t e A p p l i a n c

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

CSE 30321 Computer Architecture I Fall 2009 Final Exam December 18, 2009

CSE 30321 Computer Architecture I Fall 2009 Final Exam December 18, 2009 CSE 30321 Computer Architecture I Fall 2009 Final Exam December 18, 2009 Test Guidelines: 1. Place your name on EACH page of the test in the space provided. 2. every question in the space provided. If

More information

TPCC-UVa: An Open-Source TPC-C Implementation for Parallel and Distributed Systems

TPCC-UVa: An Open-Source TPC-C Implementation for Parallel and Distributed Systems TPCC-UVa: An Open-Source TPC-C Implementation for Parallel and Distributed Systems Diego R. Llanos and Belén Palop Universidad de Valladolid Departamento de Informática Valladolid, Spain {diego,b.palop}@infor.uva.es

More information

Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719

Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code

More information

Exploring RAID Configurations

Exploring RAID Configurations Exploring RAID Configurations J. Ryan Fishel Florida State University August 6, 2008 Abstract To address the limits of today s slow mechanical disks, we explored a number of data layouts to improve RAID

More information

The assignment of chunk size according to the target data characteristics in deduplication backup system

The assignment of chunk size according to the target data characteristics in deduplication backup system The assignment of chunk size according to the target data characteristics in deduplication backup system Mikito Ogata Norihisa Komoda Hitachi Information and Telecommunication Engineering, Ltd. 781 Sakai,

More information

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING OBJECTIVE ANALYSIS WHITE PAPER MATCH ATCHING FLASH TO THE PROCESSOR Why Multithreading Requires Parallelized Flash T he computing community is at an important juncture: flash memory is now generally accepted

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 11 Memory Management Computer Architecture Part 11 page 1 of 44 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin

More information

An Oracle White Paper May 2011. Exadata Smart Flash Cache and the Oracle Exadata Database Machine

An Oracle White Paper May 2011. Exadata Smart Flash Cache and the Oracle Exadata Database Machine An Oracle White Paper May 2011 Exadata Smart Flash Cache and the Oracle Exadata Database Machine Exadata Smart Flash Cache... 2 Oracle Database 11g: The First Flash Optimized Database... 2 Exadata Smart

More information

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,

More information

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano Intel Itanium Quad-Core Architecture for the Enterprise Lambert Schaelicke Eric DeLano Agenda Introduction Intel Itanium Roadmap Intel Itanium Processor 9300 Series Overview Key Features Pipeline Overview

More information

Power-Aware High-Performance Scientific Computing

Power-Aware High-Performance Scientific Computing Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan

More information

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking Kathlene Hurt and Eugene John Department of Electrical and Computer Engineering University of Texas at San Antonio

More information

Multi-core and Linux* Kernel

Multi-core and Linux* Kernel Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores

More information

HP Smart Array Controllers and basic RAID performance factors

HP Smart Array Controllers and basic RAID performance factors Technical white paper HP Smart Array Controllers and basic RAID performance factors Technology brief Table of contents Abstract 2 Benefits of drive arrays 2 Factors that affect performance 2 HP Smart Array

More information

benchmarking Amazon EC2 for high-performance scientific computing

benchmarking Amazon EC2 for high-performance scientific computing Edward Walker benchmarking Amazon EC2 for high-performance scientific computing Edward Walker is a Research Scientist with the Texas Advanced Computing Center at the University of Texas at Austin. He received

More information

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip Outline Modeling, simulation and optimization of Multi-Processor SoCs (MPSoCs) Università of Verona Dipartimento di Informatica MPSoCs: Multi-Processor Systems on Chip A simulation platform for a MPSoC

More information

Energy aware RAID Configuration for Large Storage Systems

Energy aware RAID Configuration for Large Storage Systems Energy aware RAID Configuration for Large Storage Systems Norifumi Nishikawa norifumi@tkl.iis.u-tokyo.ac.jp Miyuki Nakano miyuki@tkl.iis.u-tokyo.ac.jp Masaru Kitsuregawa kitsure@tkl.iis.u-tokyo.ac.jp Abstract

More information

Integrating NAND Flash Devices onto Servers By David Roberts, Taeho Kgil, and Trevor Mudge

Integrating NAND Flash Devices onto Servers By David Roberts, Taeho Kgil, and Trevor Mudge Integrating NAND Flash Devices onto Servers By David Roberts, Taeho Kgil, and Trevor Mudge doi:.45/498765.49879 Abstract Flash is a widely used storage device in portable mobile devices such as smart phones,

More information

Adaptive Bandwidth Management for Performance- Temperature Trade-offs in Heterogeneous HMC+DDRx Memory

Adaptive Bandwidth Management for Performance- Temperature Trade-offs in Heterogeneous HMC+DDRx Memory Adaptive Bandwidth Management for Performance- Temperature Trade-offs in Heterogeneous HMC+DDRx Memory Mohammad Hossein Hajkazemi, Michael Chorney, Reyhaneh Jabbarvand Behrouz, Mohammad Khavari Tavana,

More information

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55% openbench Labs Executive Briefing: April 19, 2013 Condusiv s Server Boosts Performance of SQL Server 2012 by 55% Optimizing I/O for Increased Throughput and Reduced Latency on Physical Servers 01 Executive

More information