Performance Impacts of Non-blocking Caches in Out-of-order Processors

Size: px
Start display at page:

Download "Performance Impacts of Non-blocking Caches in Out-of-order Processors"

Transcription

1 Performance Impacts of Non-blocking Caches in Out-of-order Processors Sheng Li; Ke Chen; Jay B. Brockman; Norman P. Jouppi HP Laboratories HPL Keyword(s): Non-blocking cache; MSHR; Out-of-order Processors Abstract: Non-blocking caches are an effective technique for tolerating cache-miss latency. They can reduce miss-induced processor stalls by buffering the misses and continuing to serve other independent access requests. Previous research on the complexity and performance of non-blocking caches supporting non-blocking loads showed they could achieve significant performance gains in comparison to blocking caches. However, those experiments were performed with benchmarks that are now over a decade old. Furthermore the processor that was simulated was a single-issue processor with unlimited run-ahead capability, a perfect branch predictor, fixed 16-cycle memory latency, single-cycle latency for floating point operations, and write-through and write-no-allocate caches. These assumptions are very different from today's high performance out-of-order processors such as the Intel Nehalem. Thus, it is time to re-evaluate the performance impact of non-blocking caches on practical out-of-order processors using up-to-date benchmarks. In this study, we evaluate the impacts of non-blocking data caches using the latest SPECCPU2006 benchmark suite on practical high performance out-of-order (OOO) processors. Simulations show that a data cache that supports hit-under-2-misses can provide a 17.76% performance gain for a typical high performance OOO processor running the SPECCPU 2006 benchmarks in comparison to a similar machine with a blocking cache. External Posting Date: July 06, 2011 [Fulltext] Internal Posting Date: July 06, 2011 [Fulltext] Approved for External Publication Copyright 2011 Hewlett-Packard Development Company, L.P.

2 Performance Impacts of Non-blocking Caches in Out-of-order Processors Sheng Li, Ke Chen, Jay B. Brockman, Norman P. Jouppi Hewlett-Packard Labs, University of Notre Dame {sheng.li4, {kchen2, Abstract Non-blocking caches are an effective technique for tolerating cache-miss latency. They can reduce missinduced processor stalls by buffering the misses and continuing to serve other independent access requests. Previous research on the complexity and performance of non-blocking caches supporting non-blocking loads showed they could achieve significant performance gains in comparison to blocking caches. However, those experiments were performed with benchmarks that are now over a decade old. Furthermore the processor that was simulated was a single-issue processor with unlimited run-ahead capability, a perfect branch predictor, fixed 16-cycle memory latency, single-cycle latency for floating point operations, and write-through and write-no-allocate caches. These assumptions are very different from today s high performance out-of-order processors such as the Intel Nehalem. Thus, it is time to reevaluate the performance impact of non-blocking caches on practical out-of-order processors using up-todate benchmarks. In this study, we evaluate the impacts of non-blocking data caches using the latest SPECCPU2006 benchmark suite on practical high performance out-of-order (OOO) processors. Simulations show that a data cache that supports hit-under-2-misses can provide a 17.76% performance gain for a typical high performance OOO processor running the SPECCPU 2006 benchmarks in comparison to a similar machine with a blocking cache. 1. Introduction and Motivations Non-blocking caches can eliminate the miss-induced processor stalls by buffering the misses and continuing to serve access requests. In order to leverage the benefits of non-blocking caches, there must be a pool of cache/memory operations that can be serviced out-of-order and effectively used by the processors. The processors with this ability include OOO processors such as Intel Nehalem [8], multithreaded processors such as Sun Niagara [9], and processors with run-ahead capability such as Sun Rock [10]. Previous research [1] demonstrated that data caches with non-blocking loads could achieve significant performance gain in comparison to blocking caches. The architecture assumed in previous work [1] was a single-issue processor with unlimited run-ahead capability, a perfect branch predictor, fixed 16-cycle memory latency, and single-cycle latency for floating point operations. Thus, the only stalls that could occur were those attributable to true data dependencies related to memory loads or cache lock up. This perfect architecture model can effectively isolate the impact of the data cache with non- 1

3 blocking loads from other parts of the processors. However, these assumptions are very different from today s high performance out-of-order processors such as the Intel Nehalem. In addition, the previous work is based on write-through and write-no-allocate caches, while current caches are mostly write-back. Moreover, those experiments were done with SEPCCPU92 benchmarks that are now over a decade old. Thus, it is time to re-evaluate the performance impact of non-blocking caches on practical out-of-order processors using up-to-date benchmarks. In this study, we evaluate the performance impacts of nonblocking data caches using the latest SPECCPU2006 benchmark suite on high performance out-of-order (OOO) Intel Nehalem-like processors. 2. Methodology and Experiment Setup The modeled Nehalem-like architecture has 4-issue OOO cores. Each core has a 32KB 4-way setassociative L1 instruction cache, a 32KB 8-way set-associative L1 data cache, a 256KB 8-way setassociative L2 cache, and a shared multi-banked 16-way set-associative 2MB per core L3. According to the Nehalem architecture, all caches are write-back and write-allocate. The load and store buffers have 48 and 32 entries, respectively, and support load forwarding within the same core. The re-order buffer contains 128 entries, enabling the OOO core to maintain 128 in-flight instructions. It also has a 36-entry instruction window, which enables the core to pick 4 instructions from the 36 candidates every cycle. We use M5 [3] to evaluate this architecture. We set the access time for each level of cache based on the timing specification in the Nehalem processor. The L1 Icache, L2, and L3 caches are assumed to be fully pipelined and non-blocking. We run CACTI [6] to estimate the memory latency to be around 90 cycles. The non-blocking Dcache is assumed to be implemented using inverted MSHRs [1] so that an unconstrained non-blocking cache can be achieved. Since L1 Dcache is write-back and write-allocate, we model the MSHR architecture as in [2] and [11] that extends the MSHR in [1] to support both read and write misses. This extension is necessary since the write-back cache must buffer the data that will be written to the cache line until the miss is serviced and space has been allocated for the received line. This advanced MSHR implementation can handle multiple request targets for each MHSR entry to eliminate the stalls due to secondary misses. The L1 Dcache is also modeled to have a write-back buffer that helps the MSHR to achieve non-blocking stores. Our experiments show that a 16-deep write-back buffer together with the enhanced MSHRs is sufficient to achieve non-blocking stores for the L1 Dcache. Thus, secondary misses will be handled smoothly by the same MSHR entry as for a given primary miss. Fully non-blocking cache with inverted MSHRs would require more than 128 entries so that each renamed register in the ROB can have an entry as the target for non-blocking loads. However, a 128 entry MSHR would be very expensive to implement. Our simulations show that a 64 entry MSHR is sufficient to eliminate all cache lockup cycles when running SPECCPU2006. Thus, in order to evaluate caches with different levels of non-blocking capabilities. We set the number of entries in MSHR to 0, 1, 2, and 64. The cache with no MSHR (0 entry) is a lockup cache. The cache with a 64 entry MSHR can support up to 64 in-flight misses while still servicing the requests (hit-under-64-misses) and is effectively an unconstrained non-blocking cache. The detailed architecture used in this study is shown in Table 1. 2

4 Table 1. Architecture configuration for the Nehalem-like architecture used in this study. Clock L1 I cache L1 Dcache 2.5GHz 32KB, 8way, 64B line size, 4 cycle access latency write-back, write-allocate; MSHR with 0 (lockup cache), 1, 2, and 64 (unconstrained non-blocking cache) entries, write-back buffer with 16 entries 256KB, 8way, 64B line size, 10 cycle access latency 2MB per core, 64B line size, 36 cycle access latency DDR3-1600, 90 cycle access latency L2 cache L3 cache Memory Issue width 4 Instruction window size 36 ROB Size 128 Load Buffer Size 48 Store Buffer Size 32 The SPEC CPU2006 [4] benchmark suite was used for all our experiments. This benchmark suite consists of integer (CINT) and floating-point (CFP) benchmarks, all of which are single-threaded. We selected 9 out of the 12 integer benchmarks and 14 out of the 17 floating-point benchmarks, with the detailed benchmarks shown in Table 2. Because of the long simulation time (more than a month for benchmarks such as and soplex) according to our own experiments and as in [7], we used Simpoint to reduce simulation time while maintaining accuracy. We found the representative simulation phases of each application and their weights using Simpoint 3.0 [5]. Next we simulated all simulation points and computed the final results using the simulation outputs and the weights of the simulation points. All SPEC CPU2006 benchmarks were compiled with O3 optimization using 4.2 (no further optimizations are allowed according to the SPECCPU2006 specification). Table 2 SPECCPU2006 Benchmarks Used in the Experiments. bzip2,,,,,,,, SPECCFP games,,,,,, soplex,,,,,,, 3

5 bzip2 soplex 3. Results Average Cache Block Cycle Ratio 30.00% 20.00% 10.00% 0.00% Hit-under-1-miss Hit-under-2-misses Hit-under-64-misses Figure 1. The ratio of the average Dcache/memory block cycles for a cache from lockup to fully non-blocking as the number of outstanding misses varies for the SPECCPU2006 benchmarks. The average cache block cycles are measured as cache block cycles per memory (Dcache) access. All numbers are normalized against that of the lockup cache setup (hit-under-0-miss). There are 9 integer and 14 floating point benchmarks. For the integer programs, the average ratio of Dcache/memory block cycles for hit-under-1-miss is 15.72%, and for hit-under-2-misses is 5.11%. For the floating-point programs, the two averages are 23.89% and 7.00%, respectively. Hit-under-64-misses eliminates all block cycles for both CINT and CFP benchmarks. This means that all the other machine stall cycles are due to other causes for example a data dependency on an outstanding miss. Non-blocking cache can reduce the lockup time of the cache/memory subsystem, which in turn helps to reduce the processor stall cycles induced by cache/memory for not being able to service accesses after cache lockup. Figure 1 shows the ratio on average Dcache/memory block cycles for a cache from lockup to fully non-blocking. The average cache block cycles are measured as cache block cycles per memory (Dcache) access. The average cache block cycles are dictated by both the non-blocking level of the cache and the behavior of the benchmarks (i.e the cache miss ratio and clustering pattern of the memory instructions). All numbers in Figure 1 are normalized against that of the lockup cache setup (hit-under-0- miss), so that the effectiveness of different levels of non-blocking is clearly demonstrated. On average, hit-under-2-misses reduces the memory block cycle by 94.89% and 93% for and, respectively. This demonstrates that a two-entry MSHR are enough for the SPECCPU2006 workloads to achieve non-blocking cache behavior. The average cache/memory block cycles affect both the Dcache access latency and the miss latency. Since once the cache is locked up, neither hits nor misses can be served (although it is the misses that cause the cache lockup). Figures 2 and 3 show the impact of non-blocking caches on the Dcache access latency and the miss latency, respectively. The impact of non-blocking caches is larger on the access latency than on the miss latency of the Dcache, since the former is smaller than the latter. Figure 4 shows the miss rates of all caches. Combining Figures 2, 3, and 4, it can be seen that non-blocking caches have more benefits for benchmarks with higher miss rates such as and. 4

6 bzip2 soplex bzip2 soplex Dcache Latency Ratio % 90.00% 80.00% 70.00% Hit-under-1-miss Hit-under-2-misses Hit-under-64-misses Figure 2. The ratio of the Dcache access latency for a cache from lockup to fully non-blocking as the number of outstanding misses varies for the SPECCPU2006 benchmarks. All numbers are normalized against that of the lockup cache setup (hit-under-0-miss). Dcache Miss Latency Ratio % 90.00% 80.00% 70.00% Hit-under-1-miss Hit-under-2-misses Hit-under-64-misses Figure 3. The ratio of the Dcache miss latency for a cache from lockup to fully non-blocking as the number of outstanding misses varies for the SPECCPU2006 benchmarks. All numbers are normalized against that of the lockup cache setup (hit-under-0-miss). 5

7 bzip2 soplex Miss Rate % 90.00% 80.00% 70.00% 30.00% 20.00% 10.00% 0.00% L1 Miss L2 Miss L3 Miss Figure 4. Miss rates on all caches for the SPECCPU2006 benchmarks. Average memory stall cycles cannot tell us the whole story of the performance impact of non-blocking caches on the architecture. Because we are evaluating a practical Nehalem-like architecture that can still stall for various other reasons besides cache lockup, including ROB full, instruction window full, functional units busy, branches miss-prediction, etc. For example, if there are 10% memory lockup stall cycles in the blocking case and the non-blocking cache eliminates all of them, then the average memory stall cycles are reduced by 100%. However, this reduction on average memory lockup stall cycles will translate to a different reduction ratio on the cache/memory access latency and miss latency that eventually affect the overall processor performance. Thus, it is important to evaluate the performance impact of the non-blocking cache on the overall CPI of the processor. Figure 5 shows the impact of the non-blocking Dcache on the overall CPI. Since all other architecture parameters are kept unchanged, this CPI ratio is directly caused by the use of the non-blocking Dcache. All numbers in Figure 5 are normalized against that of the lockup cache setup (hit-under-0-miss), so that the effectiveness cache with different level of non-blocking can be clearly evaluated. On average, the cache with hit-under-2-misses reduces the CPI by 16.2% for and 8.3% for, while the fully non-blocking Dcache can reduce the CPI by 17.6% for and 9.02% for. This shows that using the non-blocking Dcache that supports two in-flight misses will achieve comparable performance benefits from a fully non-blocking cache, but with much lower implementation cost. 6

8 bzip2 Soplex CPI Ratio % 90.00% 80.00% 70.00% Hit-under-1-miss Hit-under-2-misses Hit-under-64-misses Figure 5. The ratio of the CPI for a cache from lockup to fully non-blocking as the number of outstanding misses varies for the SPECCPU2006 benchmarks. All numbers are normalized against that of the lockup cache setup (hit-under-0-miss). For the integer programs: the average performance (measured as CPI) improvement is 7.08% for hit-under-1-miss, 8.36% for hit-under-2-misses, and 9.02% for hit-under-64-misses (essentially the unconstraint non-blocking cache), compared to lockup cache. For the floating point programs, the three numbers are 12.69%, 16.22%, and 17.76%, respectively. 4. Conclusions In this report, we studied the performance impact of a non-blocking data cache on practical high performance OOO processors with the latest representative applications---the SPECCPU2006 benchmark suite. Overall the non-blocking cache can improve performance by 17.7% over a lockup cache. We found a cache that supports 2 in-flight misses is sufficient to eliminate the majority of memory stall cycles and the induced processor stall cycles for most of the SPECCPU2006 benchmarks. This is the sweet design spot to achieve balanced trade-offs between performance gain and the implementation complexity. Finally, our study shows similar trends but with less magnitude as compared to the earlier study [1] that was performed assuming a perfect single-issue processor. This is because stalls caused by other reasons such as hardware resource conflicts, branch miss-predictions, and the long floating point operation latencies attenuate the performance dependency on the non-blocking cache. 7

9 5. References [1] Keith I. Farkas and Norman P. Jouppi, Complexity/Performance Tradeoffs with Non-Blocking Loads, ISCA94 [2] J. Tuck et al., Scalable Cache Miss Handling for High Memory-Level Parallelism. MICRO 39, [3] N. L. Binkert et al., The M5 Simulator: Modeling Networked Systems, IEEE Micro, vol. 26, no. 4, pp.52 60, 2006 [4] J. L. Henning, Performance Counters and Development of SPEC CPU2006, Computer Architecture News, vol. 35, no. 1, [5] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, Automatically Characterizing Large Scale Program Behavior, ASPLOS, Oct 2002 [6] CACTI 6.5, [7] Karthik Ganesan, Deepak Panwar, and Lizy John Generation, Validation and Analysis of SPEC CPU2006 Simulation Points Based on Branch, Memory, and TLB Characteristics, In 2009 SPEC Benchmark Workshop, Austin. [8] Kumar, R. and Hinton, G, A family of 45nm IA processors ISSCC 2009 [9] P. Kongetira, K. Aingaran, and K. Olukotun, Niagara: A 32-Way Multithreaded Sparc processor, IEEE Micro, vol. 25, no. 2, [10] Marc Tremblay and Shailender Chaudhry, A Third-Generation 65nm 16-Core 32-Thread Plus32-Scout-Thread CMT SPARC Processor ISSCC 2008 [11] M. Jahre and L. Natvig, Performance Effects of a Cache Miss Handling Architecture in a Multicore Processor, NIK-2007 conference,

Scalable Cache Miss Handling For High MLP

Scalable Cache Miss Handling For High MLP Scalable Cache Miss Handling For High MLP James Tuck, Luis Ceze, and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Introduction Checkpointed processors are promising

More information

NIAGARA: A 32-WAY MULTITHREADED SPARC PROCESSOR

NIAGARA: A 32-WAY MULTITHREADED SPARC PROCESSOR NIAGARA: A 32-WAY MULTITHREADED SPARC PROCESSOR THE NIAGARA PROCESSOR IMPLEMENTS A THREAD-RICH ARCHITECTURE DESIGNED TO PROVIDE A HIGH-PERFORMANCE SOLUTION FOR COMMERCIAL SERVER APPLICATIONS. THE HARDWARE

More information

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing Robert Golla Senior Hardware Architect Paul Jordan Senior Principal Hardware Engineer Oracle

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures 11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the

More information

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking Kathlene Hurt and Eugene John Department of Electrical and Computer Engineering University of Texas at San Antonio

More information

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS AN INSTRUCTION WINDOW THAT CAN TOLERATE LATENCIES TO DRAM MEMORY IS PROHIBITIVELY COMPLEX AND POWER HUNGRY. TO AVOID HAVING TO

More information

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture? This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo

More information

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches: Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):

More information

LOCKUP-FREE INSTRUCTION FETCH/PREFETCH CACHE ORGANIZATION

LOCKUP-FREE INSTRUCTION FETCH/PREFETCH CACHE ORGANIZATION LOCKUP-FREE INSTRUCTION FETCH/PREFETCH CACHE ORGANIZATION DAVID KROFT Control Data Canada, Ltd. Canadian Development Division Mississauga, Ontario, Canada ABSTRACT In the past decade, there has been much

More information

A Hybrid Analytical Modeling of Pending Cache Hits, Data Prefetching, and MSHRs 1

A Hybrid Analytical Modeling of Pending Cache Hits, Data Prefetching, and MSHRs 1 A Hybrid Analytical Modeling of Pending Cache Hits, Data Prefetching, and MSHRs 1 XI E. CHEN and TOR M. AAMODT University of British Columbia This paper proposes techniques to predict the performance impact

More information

Achieving QoS in Server Virtualization

Achieving QoS in Server Virtualization Achieving QoS in Server Virtualization Intel Platform Shared Resource Monitoring/Control in Xen Chao Peng ([email protected]) 1 Increasing QoS demand in Server Virtualization Data center & Cloud infrastructure

More information

Categories and Subject Descriptors C.1.1 [Processor Architecture]: Single Data Stream Architectures. General Terms Performance, Design.

Categories and Subject Descriptors C.1.1 [Processor Architecture]: Single Data Stream Architectures. General Terms Performance, Design. Enhancing Memory Level Parallelism via Recovery-Free Value Prediction Huiyang Zhou Thomas M. Conte Department of Electrical and Computer Engineering North Carolina State University 1-919-513-2014 {hzhou,

More information

Intel Pentium 4 Processor on 90nm Technology

Intel Pentium 4 Processor on 90nm Technology Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction

More information

18-548/15-548 Associativity 9/16/98. 7 Associativity. 18-548/15-548 Memory System Architecture Philip Koopman September 16, 1998

18-548/15-548 Associativity 9/16/98. 7 Associativity. 18-548/15-548 Memory System Architecture Philip Koopman September 16, 1998 7 Associativity 18-548/15-548 Memory System Architecture Philip Koopman September 16, 1998 Required Reading: Cragon pg. 166-174 Assignments By next class read about data management policies: Cragon 2.2.4-2.2.6,

More information

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends This Unit CIS 501 Computer Architecture! Metrics! Latency and throughput! Reporting performance! Benchmarking and averaging Unit 2: Performance! CPU performance equation & performance trends CIS 501 (Martin/Roth):

More information

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Technical white paper HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Scale-up your Microsoft SQL Server environment to new heights Table of contents Executive summary... 2 Introduction...

More information

OC By Arsene Fansi T. POLIMI 2008 1

OC By Arsene Fansi T. POLIMI 2008 1 IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5

More information

Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719

Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code

More information

Scalable Cache Miss Handling for High Memory-Level Parallelism

Scalable Cache Miss Handling for High Memory-Level Parallelism Scalable Cache Miss Handling for High Memory-Level Parallelism James Tuck, Luis Ceze,andJosep Torrellas University of Illinois at Urbana-Champaign {jtuck,luisceze,torrellas}@cs.uiuc.edu http://iacoma.cs.uiuc.edu

More information

Improving Grid Processing Efficiency through Compute-Data Confluence

Improving Grid Processing Efficiency through Compute-Data Confluence Solution Brief GemFire* Symphony* Intel Xeon processor Improving Grid Processing Efficiency through Compute-Data Confluence A benchmark report featuring GemStone Systems, Intel Corporation and Platform

More information

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?

More information

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor Travis Lanier Senior Product Manager 1 Cortex-A15: Next Generation Leadership Cortex-A class multi-processor

More information

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.

More information

Host Power Management in VMware vsphere 5

Host Power Management in VMware vsphere 5 in VMware vsphere 5 Performance Study TECHNICAL WHITE PAPER Table of Contents Introduction.... 3 Power Management BIOS Settings.... 3 Host Power Management in ESXi 5.... 4 HPM Power Policy Options in ESXi

More information

Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi

Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi University of Utah & HP Labs 1 Large Caches Cache hierarchies

More information

A Performance Counter Architecture for Computing Accurate CPI Components

A Performance Counter Architecture for Computing Accurate CPI Components A Performance Counter Architecture for Computing Accurate CPI Components Stijn Eyerman Lieven Eeckhout Tejas Karkhanis James E. Smith ELIS, Ghent University, Belgium ECE, University of Wisconsin Madison

More information

Energy-Efficient, High-Performance Heterogeneous Core Design

Energy-Efficient, High-Performance Heterogeneous Core Design Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

More information

SERVER CLUSTERING TECHNOLOGY & CONCEPT

SERVER CLUSTERING TECHNOLOGY & CONCEPT SERVER CLUSTERING TECHNOLOGY & CONCEPT M00383937, Computer Network, Middlesex University, E mail: [email protected] Abstract Server Cluster is one of the clustering technologies; it is use for

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

Thread Level Parallelism II: Multithreading

Thread Level Parallelism II: Multithreading Thread Level Parallelism II: Multithreading Readings: H&P: Chapter 3.5 Paper: NIAGARA: A 32-WAY MULTITHREADED Thread Level Parallelism II: Multithreading 1 This Unit: Multithreading (MT) Application OS

More information

SPARC64 VIIIfx: CPU for the K computer

SPARC64 VIIIfx: CPU for the K computer SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS

More information

UltraSPARC T1: A 32-threaded CMP for Servers. James Laudon Distinguished Engineer Sun Microsystems [email protected]

UltraSPARC T1: A 32-threaded CMP for Servers. James Laudon Distinguished Engineer Sun Microsystems james.laudon@sun.com UltraSPARC T1: A 32-threaded CMP for Servers James Laudon Distinguished Engineer Sun Microsystems [email protected] Outline Page 2 Server design issues > Application demands > System requirements Building

More information

OpenSPARC T1 Processor

OpenSPARC T1 Processor OpenSPARC T1 Processor The OpenSPARC T1 processor is the first chip multiprocessor that fully implements the Sun Throughput Computing Initiative. Each of the eight SPARC processor cores has full hardware

More information

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10 Performance Counter Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10 Performance Counter Hardware Unit for event measurements Performance Monitoring Unit (PMU) Originally for CPU-Debugging

More information

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU

More information

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers X: Fujitsu s New Generation 16 Processor for the next generation UNIX servers August 29, 2012 Takumi Maruyama Processor Development Division Enterprise Server Business Unit Fujitsu Limited All Rights Reserved,Copyright

More information

ICRI-CI Retreat Architecture track

ICRI-CI Retreat Architecture track ICRI-CI Retreat Architecture track Uri Weiser June 5 th 2015 - Funnel: Memory Traffic Reduction for Big Data & Machine Learning (Uri) - Accelerators for Big Data & Machine Learning (Ran) - Machine Learning

More information

McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures

McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Sheng Li, Junh Ho Ahn, Richard Strong, Jay B. Brockman, Dean M Tullsen, Norman Jouppi MICRO 2009

More information

EEM 486: Computer Architecture. Lecture 4. Performance

EEM 486: Computer Architecture. Lecture 4. Performance EEM 486: Computer Architecture Lecture 4 Performance EEM 486 Performance Purchasing perspective Given a collection of machines, which has the» Best performance?» Least cost?» Best performance / cost? Design

More information

A Taxonomy to Enable Error Recovery and Correction in Software

A Taxonomy to Enable Error Recovery and Correction in Software A Taxonomy to Enable Error Recovery and Correction in Software Vilas Sridharan ECE Department rtheastern University 360 Huntington Ave. Boston, MA 02115 [email protected] Dean A. Liberty Advanced Micro

More information

Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang [email protected] University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

More information

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09 Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors NoCArc 09 Jesús Camacho Villanueva, José Flich, José Duato Universidad Politécnica de Valencia December 12,

More information

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU, SOORAJ PUTHOOR, BRADFORD M BECKMANN, MARK D HILL*, STEVEN K REINHARDT, DAVID A WOOD* *University of

More information

FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

More information

Modeling Virtual Machine Performance: Challenges and Approaches

Modeling Virtual Machine Performance: Challenges and Approaches Modeling Virtual Machine Performance: Challenges and Approaches Omesh Tickoo Ravi Iyer Ramesh Illikkal Don Newell Intel Corporation Intel Corporation Intel Corporation Intel Corporation [email protected]

More information

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab Performance monitoring at CERN openlab July 20 th 2012 Andrzej Nowak, CERN openlab Data flow Reconstruction Selection and reconstruction Online triggering and filtering in detectors Raw Data (100%) Event

More information

AMD PhenomII. Architecture for Multimedia System -2010. Prof. Cristina Silvano. Group Member: Nazanin Vahabi 750234 Kosar Tayebani 734923

AMD PhenomII. Architecture for Multimedia System -2010. Prof. Cristina Silvano. Group Member: Nazanin Vahabi 750234 Kosar Tayebani 734923 AMD PhenomII Architecture for Multimedia System -2010 Prof. Cristina Silvano Group Member: Nazanin Vahabi 750234 Kosar Tayebani 734923 Outline Introduction Features Key architectures References AMD Phenom

More information

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Quiz for Chapter 1 Computer Abstractions and Technology 3.10 Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,

More information

NVIDIA Tegra 4 Family CPU Architecture

NVIDIA Tegra 4 Family CPU Architecture Whitepaper NVIDIA Tegra 4 Family CPU Architecture 4-PLUS-1 Quad core 1 Table of Contents... 1 Introduction... 3 NVIDIA Tegra 4 Family of Mobile Processors... 3 Benchmarking CPU Performance... 4 Tegra 4

More information

This Unit: Caches. CIS 501 Introduction to Computer Architecture. Motivation. Types of Memory

This Unit: Caches. CIS 501 Introduction to Computer Architecture. Motivation. Types of Memory This Unit: Caches CIS 5 Introduction to Computer Architecture Unit 3: Storage Hierarchy I: Caches Application OS Compiler Firmware CPU I/O Memory Digital Circuits Gates & Transistors Memory hierarchy concepts

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

An examination of the dual-core capability of the new HP xw4300 Workstation

An examination of the dual-core capability of the new HP xw4300 Workstation An examination of the dual-core capability of the new HP xw4300 Workstation By employing single- and dual-core Intel Pentium processor technology, users have a choice of processing power options in a compact,

More information

Precise and Accurate Processor Simulation

Precise and Accurate Processor Simulation Precise and Accurate Processor Simulation Harold Cain, Kevin Lepak, Brandon Schwartz, and Mikko H. Lipasti University of Wisconsin Madison http://www.ece.wisc.edu/~pharm Performance Modeling Analytical

More information

Building an Inexpensive Parallel Computer

Building an Inexpensive Parallel Computer Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University

More information

HP reference configuration for entry-level SAS Grid Manager solutions

HP reference configuration for entry-level SAS Grid Manager solutions HP reference configuration for entry-level SAS Grid Manager solutions Up to 864 simultaneous SAS jobs and more than 3 GB/s I/O throughput Technical white paper Table of contents Executive summary... 2

More information

Chapter 2. Why is some hardware better than others for different programs?

Chapter 2. Why is some hardware better than others for different programs? Chapter 2 1 Performance Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation Why is some hardware better than

More information

Computer Science 146/246 Homework #3

Computer Science 146/246 Homework #3 Computer Science 146/246 Homework #3 Due 11:59 P.M. Sunday, April 12th, 2015 We played with a Pin-based cache simulator for Homework 2. This homework will prepare you to setup and run a detailed microarchitecture-level

More information

CS/COE1541: Introduction to Computer Architecture. Memory hierarchy. Sangyeun Cho. Computer Science Department University of Pittsburgh

CS/COE1541: Introduction to Computer Architecture. Memory hierarchy. Sangyeun Cho. Computer Science Department University of Pittsburgh CS/COE1541: Introduction to Computer Architecture Memory hierarchy Sangyeun Cho Computer Science Department CPU clock rate Apple II MOS 6502 (1975) 1~2MHz Original IBM PC (1981) Intel 8080 4.77MHz Intel

More information

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano Intel Itanium Quad-Core Architecture for the Enterprise Lambert Schaelicke Eric DeLano Agenda Introduction Intel Itanium Roadmap Intel Itanium Processor 9300 Series Overview Key Features Pipeline Overview

More information

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27 Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.

More information

2

2 1 2 3 4 5 For Description of these Features see http://download.intel.com/products/processor/corei7/prod_brief.pdf The following Features Greatly affect Performance Monitoring The New Performance Monitoring

More information

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Using Synology SSD Technology to Enhance System Performance Synology Inc. Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_SSD_Cache_WP_ 20140512 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges...

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

on an system with an infinite number of processors. Calculate the speedup of

on an system with an infinite number of processors. Calculate the speedup of 1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements

More information

EE361: Digital Computer Organization Course Syllabus

EE361: Digital Computer Organization Course Syllabus EE361: Digital Computer Organization Course Syllabus Dr. Mohammad H. Awedh Spring 2014 Course Objectives Simply, a computer is a set of components (Processor, Memory and Storage, Input/Output Devices)

More information

Computer Architecture TDTS10

Computer Architecture TDTS10 why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers

More information

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

More information

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007 Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

More information

TPCalc : a throughput calculator for computer architecture studies

TPCalc : a throughput calculator for computer architecture studies TPCalc : a throughput calculator for computer architecture studies Pierre Michaud Stijn Eyerman Wouter Rogiest IRISA/INRIA Ghent University Ghent University [email protected] [email protected]

More information

Tableau Server 7.0 scalability

Tableau Server 7.0 scalability Tableau Server 7.0 scalability February 2012 p2 Executive summary In January 2012, we performed scalability tests on Tableau Server to help our customers plan for large deployments. We tested three different

More information

Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.

Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier. Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core

More information

A Predictive Model for Cache-Based Side Channels in Multicore and Multithreaded Microprocessors

A Predictive Model for Cache-Based Side Channels in Multicore and Multithreaded Microprocessors A Predictive Model for Cache-Based Side Channels in Multicore and Multithreaded Microprocessors Leonid Domnitser, Nael Abu-Ghazaleh and Dmitry Ponomarev Department of Computer Science SUNY-Binghamton {lenny,

More information

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers CS 4 Introduction to Compilers ndrew Myers Cornell University dministration Prelim tomorrow evening No class Wednesday P due in days Optional reading: Muchnick 7 Lecture : Instruction scheduling pr 0 Modern

More information

Multithreading Lin Gao cs9244 report, 2006

Multithreading Lin Gao cs9244 report, 2006 Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............

More information

How To Build A Cloud Computer

How To Build A Cloud Computer Introducing the Singlechip Cloud Computer Exploring the Future of Many-core Processors White Paper Intel Labs Jim Held Intel Fellow, Intel Labs Director, Tera-scale Computing Research Sean Koehl Technology

More information

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers WHITE PAPER FUJITSU PRIMERGY AND PRIMEPOWER SERVERS Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers CHALLENGE Replace a Fujitsu PRIMEPOWER 2500 partition with a lower cost solution that

More information

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy The Quest for Speed - Memory Cache Memory CSE 4, Spring 25 Computer Systems http://www.cs.washington.edu/4 If all memory accesses (IF/lw/sw) accessed main memory, programs would run 20 times slower And

More information

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Measuring Cache and Memory Latency and CPU to Memory Bandwidth White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary

More information

Delivering Quality in Software Performance and Scalability Testing

Delivering Quality in Software Performance and Scalability Testing Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,

More information