Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications. Yuan Chou Architecture Technology Group Microelectronics Division
|
|
- Malcolm Page
- 7 years ago
- Views:
Transcription
1 Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications Yuan Chou Architecture Technology Group Microelectronics Division 1
2 2 Motivation Performance of many commercial applications limited by processor stalls due to off-chip cache misses Applications characterized by irregular control-flow and complex data access patterns Software prefetching and simple stride-based hardware prefetching ineffective Hardware correlation prefetching more promising - can remember complex recurring data access patterns Current correlation prefetchers have severe drawbacks but we think we can overcome them
3 3 Talk Outline Traditional Correlation Prefetching Epoch-Based Correlation Prefetching Experimental Results Summary
4 Traditional Correlation Prefetching Basic idea: use current miss address M to predict N future miss addresses F 1...F N (where N = prefetch depth) Miss address sequence: A B C D E F G H I assume N=2 Use A to prefetch B C Use D to prefetch E F Use G to prefetch H I Correlations recorded in correlation table Correlation table size proportional to application working set
5 5 Correlation Prefetching Drawbacks Very large correlation tables needed for commercial apps - impractical to store on-chip No attempt to eliminate all naturally overlapped misses Miss address sequence: A B C D E F G H I A B C D F G H I time compute off-chip access E
6 5 Correlation Prefetching Drawbacks Very large correlation tables needed for commercial apps - impractical to store on-chip No attempt to eliminate all naturally overlapped misses Miss address sequence: A B C D E F G H I A B C DE F G H I time Since C, D and E naturally overlapped, prefetching only C may not improve performance
7 5 Correlation Prefetching Drawbacks Very large correlation tables needed for commercial apps - impractical to store on-chip No attempt to eliminate all naturally overlapped misses Prefetches misses naturally overlapped with current miss Miss address sequence: A B C D E F G H I A B C DE F G H I time Since A and B naturally overlapped, prefetching B does not improve performance but wastes table storage
8 6 Epoch-Based Correlation Prefetching (EBCP)
9 7 Epoch MLP Model At high off-chip latencies, overlappable off-chip accesses appear to issue and complete together Program execution separates into recurring periods of onchip computation followed by off-chip accesses A B C DE F G H I time compute off-chip access
10 7 Epoch MLP Model At high off-chip latencies, overlappable off-chip accesses appear to issue and complete together Program execution separates into recurring periods of onchip computation followed by off-chip accesses A B C DE F G H I time Epoch i Epoch i+1 Epoch i+2 Epoch i+3 Call each period an epoch
11 Epoch MLP Model At high off-chip latencies, overlappable off-chip accesses appear to issue and complete together Program execution separates into recurring periods of onchip computation followed by off-chip accesses A B C DE F G H I time Epoch i Epoch i+1 Epoch i+2 Epoch i+3 Call each period an epoch Group off-chip accesses based on which epoch they issue Epoch i i+1 i+2 i+3 Miss addresses A B C D E F G H I 7
12 Epoch Model Insights Insight #1: Target removal of entire epochs instead of individual misses Miss address sequence: A B C D E F G H I A B C D F G H I time E Epoch i Epoch i+1 Epoch i+2 Epoch i+3 Epoch i i+1 i+2 i+3 Miss addresses A B C D E F G H I X X Use first miss in epoch to prefetch all misses in next 2 epochs Results in removal of 2 epochs
13 Epoch-Based Correlation Prefetcher No prefetching Epoch i i+1 i+2 i+3 Miss addresses A B C D E F G H I Epoch-based correlation prefetching (EBCP) Epoch i i+1 Miss addresses A B H I Prefetches C D E F G Traditional correlation prefetching (depth=2) Epoch i i+1 i+2 Miss addresses A B E H I Prefetches B C D F G EBCP achieves better epoch reduction 9
14 Epoch Model Insights Insight #2: Hide latency of correlation table access under previous epoch A B C DE F G H I time Epoch i Epoch i+1 Epoch i+2 Epoch i+3 Epoch i i+1 i+2 i+3 Miss addresses A B C D E F G H I Prefetches F G H I Read correlation table Use miss in epoch i to prefetch all misses in epochs i+2 and i+3 Use epoch i to read correlation table Use epoch i+1 to issue prefetches 10
15 10 Epoch Model Insights Insight #2: Hide latency of correlation table access under previous epoch A B C DE F G H I time Epoch i Epoch i+1 Epoch i+2 Epoch i+3 Epoch i i+1 i+2 i+3 Miss addresses A B C D E F G H I Prefetches F G H I Read correlation table Results in removal of 2 epochs X X Correlation table can be stored in main memory!
16 EBCP Advantages Trad: store correlation table on-chip EBCP: store correlation table in main memory (hide table access latency under previous epoch) Trad: no attempt to eliminate all naturally overlapped misses EBCP: target removal of entire epochs Trad: prefetch misses naturally overlapped with current miss EBCP: avoid prefetching these misses EBCP overcomes drawbacks of traditional correlation prefetchers 11
17 Prefetcher control observes all L2 cache requests L2 banks notify prefetcher control which requests are misses 12 EBCP Components L1-I L1-D Processor Core Prefetch Control Crossbar Processor Core L1-I L1-D L2 bank L2 bank L2 bank L2 bank Memory Controller Correlation Table DRAM Memory Controller Correlation Table DRAM
18 EBCP Prefetcher Control Request OS for memory to store correlation table Detect epochs observe when number of off-chip misses transition 0 to 1 Learn correlations record correlations in main memory correlation table Issue prefetches use first miss address in epoch to look up correlation table select miss addresses from correlation table entry issue prefetches (lower priority than demand accesses) Return memory to OS if needed EBCP very simple and requires almost zero on-chip storage! 13
19 Experimental Results 1
20 Baseline Processor Model Moderate out-of-order issue core single thread -wide issue 6 entry issue queue, 12 entry reorder buffer KB -way L1 instruction and data caches 2MB -way L2 cache prefetches installed into prefetch buffer Memory bandwidth model 9.6 GB/s read bandwidth. GB/s write bandwidth 500 cycle unloaded memory latency Commercial applications benchmarks OLTP, TPC-W, SPECjbb2005, SPECjAppServer200 15
21 Effects of Prefetch Degree 5% 0% Infinite correlation table OLTP TPC-W SPECjbb SPECjAppServer 35% 30% 25% 20% 15% 10% 5% 0% % Performance Improvement Prefetch Degree Performance improvement increases with prefetch degree
22 Coverage vs Accuracy take-away % 10% 20% 30% 0% 50% 60% Prefetch Degree % Coverage % 10% 20% 30% 0% 50% Prefetch Degree % Accuracy OLTP OLTP TPC-W TPC-W SPECjAppServer SPECjAppServer SPECjbb SPECjbb 17
23 Memory Bandwidth Sensitivity 5% 0% OLTP TPC-W SPECjbb SPECjAppServer % Performance Improvement 35% 30% 25% 20% 15% 10% 5% 0% BW=3.2GB/s BW=6.GB/s BW=9.6GB/s -5% Prefetch Degree Optimal prefetch degree depends on available memory BW 1
24 Correlation Table Size % Performance Improvement 35% 30% 25% 20% 15% 10% 5% Prefetch degree OLTP TPC-W SPECjbb SPECjAppServer 0% 6K 12K 256K 512K 1M 2M M M 6K 12K 256K 512K 1M 2M M M 6K 12K 256K 512K 1M 2M M M Predictor Table Entries 6K 12K 256K 512K 1M 2M M M Storing table in main memory makes such large sizes practical 19
25 Comparison with Other Prefetchers Global History Buffer G/AC (GHB) address correlation, unique table storage (small: 256KB large: MB) Tag Correlating Prefetcher (TCP) tag correlation (small: 256KB large: MB) Stream traditional stride-based stream prefetcher Spatial Memory Streaming (SMS) spatial locality within region (12KB) Solihin memory-side address correlation prefetcher (6MB) Prefetch degree = 6 for all prefetchers (except SMS) Prefetches brought into 6 entry prefetch buffer 20
26 Comparison with Other Prefetchers % Performance Improvement 25% 20% 15% 10% 5% OLTP TPC-W SPECjbb SPECjAppServer 0% GHB small GHB large TCP small TCP large Stream SMS Solihin 3,2 Solihin 6,1 EBCP minus EBCP GHB small GHB large TCP small TCP large Stream SMS Solihin 3,2 Solihin 6,1 EBCP minus EBCP GHB small GHB large TCP small TCP large Stream SMS Solihin 3,2 Solihin 6,1 EBCP minus EBCP GHB small GHB large TCP small TCP large Stream SMS Solihin 3,2 Solihin 6,1 EBCP minus EBCP EBCP outperforms all prefetchers for all four benchmarks 21
27 Summary EBCP successfully overcomes drawbacks of traditional correlation prefetchers stores large correlation table in main memory exploits unused memory capacity and bandwidth targets removal of entire epochs very simple prefetcher control almost zero on-chip storage EBCP performs very well on all four commercial benchmarks Future work: efficient implementation for chip multi-processors improved accuracy Epoch-based concept can be applied to other uarch techniques! 22
28 Yuan Chou 2
29 Prefetch Buffer Size % Performance Improvement 35% 30% 25% 20% 15% 10% 5% Prefetch degree 1 million table entries OLTP TPC-W SPECjbb SPECjAppServer 0% Prefetch Buffer Entries 6 entries sufficient for all four benchmarks 19
SHIFT! Shared History Instruction Fetch! for Lean-Core Server Processors" Cansu Kaynak, Boris Grot, Babak Falsafi"
SHIFT! Shared History Instruction Fetch! for Lean-Core Server Processors" Cansu Kaynak, Boris Grot, Babak Falsafi" Instruction Fetch Stalls in Servers" Traditional and emerging server apps:" Deep software
More informationEnterprise Applications
Enterprise Applications Chi Ho Yue Sorav Bansal Shivnath Babu Amin Firoozshahian EE392C Emerging Applications Study Spring 2003 Functionality Online Transaction Processing (OLTP) Users/apps interacting
More informationThread level parallelism
Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process
More informationDACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs. Presenter: Bo Zhang Yulin Shi
DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs Presenter: Bo Zhang Yulin Shi Outline Motivation & Goal Solution - DACOTA overview Technical Insights Experimental Evaluation
More informationMore on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
More informationChapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
More informationParallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.
Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core
More information361 Computer Architecture Lecture 14: Cache Memory
1 361 Computer Architecture Lecture 14 Memory cache.1 The Motivation for s Memory System Processor DRAM Motivation Large memories (DRAM) are slow Small memories (SRAM) are fast Make the average access
More information<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing
T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing Robert Golla Senior Hardware Architect Paul Jordan Senior Principal Hardware Engineer Oracle
More informationPractical Off-chip Meta-data for Temporal Memory Streaming
Practical Off-chip Meta-data for Temporal Memory Streaming Thomas F. Wenisch *, Michael Ferdman, Anastasia Ailamaki, Babak Falsafi and Andreas Moshovos http://www.ece.cmu.edu/~stems * University of Michigan
More informationArchitecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
More informationSolution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:
Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):
More informationPerformance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09
Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors NoCArc 09 Jesús Camacho Villanueva, José Flich, José Duato Universidad Politécnica de Valencia December 12,
More informationPutting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719
Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationSHIFT: Shared History Instruction Fetch for Lean-Core Server Processors
: Shared History Instruction Fetch for Lean-Core Server Processors Cansu Kaynak EcoCloud, EPFL Boris Grot * University of Edinburgh Babak Falsafi EcoCloud, EPFL ABSTRACT In server workloads, large instruction
More informationUsing Synology SSD Technology to Enhance System Performance Synology Inc.
Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_SSD_Cache_WP_ 20140512 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges...
More informationPower-Aware High-Performance Scientific Computing
Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan
More informationIntroduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
More informationCombining Local and Global History for High Performance Data Prefetching
Journal of Instruction-Level Parallelism 13 (2011) 1-14 Submitted 3/10; published 1/11 Combining Local and Global History for High Performance Data Prefetching Martin Dimitrov Department of Electrical
More informationTexture Cache Approximation on GPUs
Texture Cache Approximation on GPUs Mark Sutherland Joshua San Miguel Natalie Enright Jerger {suther68,enright}@ece.utoronto.ca, joshua.sanmiguel@mail.utoronto.ca 1 Our Contribution GPU Core Cache Cache
More informationMeasuring Cache and Memory Latency and CPU to Memory Bandwidth
White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary
More informationSymmetric Multiprocessing
Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called
More informationwww.opensparc.net Creative Commons Attribution-Share 3.0 United States License
OpenSPARC Slide-Cast In 12 Chapters Presented by OpenSPARC designers, developers, and programmers to guide users as they develop their own OpenSPARC designs and to assist professors as they teach the nextavailable
More informationNext Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More informationA Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures
11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the
More informationOpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
More informationMultithreading Lin Gao cs9244 report, 2006
Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............
More informationMemory Access Control in Multiprocessor for Real-time Systems with Mixed Criticality
Memory Access Control in Multiprocessor for Real-time Systems with Mixed Criticality Heechul Yun +, Gang Yao +, Rodolfo Pellizzoni *, Marco Caccamo +, Lui Sha + University of Illinois at Urbana and Champaign
More informationCHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
More informationIn-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
More informationPerformance Analysis of Thread Mappings with a Holistic View of the Hardware Resources
Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources Wei Wang, Tanima Dey, Jason Mars, Lingjia Tang, Jack Davidson, Mary Lou Soffa Department of Computer Science University
More informationEnsuring Quality of Service in High Performance Servers
Ensuring Quality of Service in High Performance Servers YAN SOLIHIN Fei Guo, Seongbeom Kim, Fang Liu Center of Efficient, Secure, and Reliable Computing (CESR) North Carolina State University solihin@ece.ncsu.edu
More informationOpenSPARC T1 Processor
OpenSPARC T1 Processor The OpenSPARC T1 processor is the first chip multiprocessor that fully implements the Sun Throughput Computing Initiative. Each of the eight SPARC processor cores has full hardware
More informationScalable Cache Miss Handling For High MLP
Scalable Cache Miss Handling For High MLP James Tuck, Luis Ceze, and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Introduction Checkpointed processors are promising
More informationSoftware Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x
Software Pipelining for (i=1, i
More informationSummary. Key results at a glance:
An evaluation of blade server power efficiency for the, Dell PowerEdge M600, and IBM BladeCenter HS21 using the SPECjbb2005 Benchmark The HP Difference The ProLiant BL260c G5 is a new class of server blade
More informationUsing Synology SSD Technology to Enhance System Performance Synology Inc.
Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_WP_ 20121112 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges... 3 SSD
More informationMotivation: Smartphone Market
Motivation: Smartphone Market Smartphone Systems External Display Device Display Smartphone Systems Smartphone-like system Main Camera Front-facing Camera Central Processing Unit Device Display Graphics
More informationIntroduction. Application Performance in the QLinux Multimedia Operating System. Solution: QLinux. Introduction. Outline. QLinux Design Principles
Application Performance in the QLinux Multimedia Operating System Sundaram, A. Chandra, P. Goyal, P. Shenoy, J. Sahni and H. Vin Umass Amherst, U of Texas Austin ACM Multimedia, 2000 Introduction General
More informationThis Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings
This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?
More informationExploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager
Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor Travis Lanier Senior Product Manager 1 Cortex-A15: Next Generation Leadership Cortex-A class multi-processor
More informationIntroduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationNVIDIA Tegra 4 Family CPU Architecture
Whitepaper NVIDIA Tegra 4 Family CPU Architecture 4-PLUS-1 Quad core 1 Table of Contents... 1 Introduction... 3 NVIDIA Tegra 4 Family of Mobile Processors... 3 Benchmarking CPU Performance... 4 Tegra 4
More informationThe Orca Chip... Heart of IBM s RISC System/6000 Value Servers
The Orca Chip... Heart of IBM s RISC System/6000 Value Servers Ravi Arimilli IBM RISC System/6000 Division 1 Agenda. Server Background. Cache Heirarchy Performance Study. RS/6000 Value Server System Structure.
More informationIntel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano
Intel Itanium Quad-Core Architecture for the Enterprise Lambert Schaelicke Eric DeLano Agenda Introduction Intel Itanium Roadmap Intel Itanium Processor 9300 Series Overview Key Features Pipeline Overview
More informationChip Multithreading: Opportunities and Challenges
Chip Multithreading: Opportunities and Challenges Lawrence Spracklen & Santosh G. Abraham Scalable Systems Group Sun Microsystems Inc., Sunnyvale, CA {lawrence.spracklen,santosh.abraham}@sun.com Abstract
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationProviding Safe, User Space Access to Fast, Solid State Disks. Adrian Caulfield, Todor Mollov, Louis Eisner, Arup De, Joel Coburn, Steven Swanson
Moneta-Direct: Providing Safe, User Space Access to Fast, Solid State Disks Adrian Caulfield, Todor Mollov, Louis Eisner, Arup De, Joel Coburn, Steven Swanson Non-volatile Systems Laboratory Department
More informationHETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS
HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU, SOORAJ PUTHOOR, BRADFORD M BECKMANN, MARK D HILL*, STEVEN K REINHARDT, DAVID A WOOD* *University of
More informationImproving System Scalability of OpenMP Applications Using Large Page Support
Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support Ranjit Noronha and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) The Ohio State University Outline
More informationEFetch: Optimizing Instruction Fetch for Event-Driven Web Applications
: Optimizing Instruction Fetch for Event-Driven Web Applications Gaurav Chadha Scott Mahlke Satish Narayanasamy University of Michigan, Ann Arbor {gauravc, mahlke, nsatish}@umich.edu ABSTRACT Web 2. applications
More informationA Hybrid Analytical Modeling of Pending Cache Hits, Data Prefetching, and MSHRs 1
A Hybrid Analytical Modeling of Pending Cache Hits, Data Prefetching, and MSHRs 1 XI E. CHEN and TOR M. AAMODT University of British Columbia This paper proposes techniques to predict the performance impact
More informationCOMPUTER HARDWARE. Input- Output and Communication Memory Systems
COMPUTER HARDWARE Input- Output and Communication Memory Systems Computer I/O I/O devices commonly found in Computer systems Keyboards Displays Printers Magnetic Drives Compact disk read only memory (CD-ROM)
More informationMaking Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
More informationIntel Pentium 4 Processor on 90nm Technology
Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended
More informationRUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS
RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS AN INSTRUCTION WINDOW THAT CAN TOLERATE LATENCIES TO DRAM MEMORY IS PROHIBITIVELY COMPLEX AND POWER HUNGRY. TO AVOID HAVING TO
More informationGPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
More informationStorage Performance Testing
Storage Performance Testing Woody Hutsell, Texas Memory Systems SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individuals may use this material
More informationCache-Conscious Allocation of Pointer-Based Data Structures Revisited with HW/SW Prefetching
Cache-Conscious Allocation of Pointer-Based Data Structures Revisited with HW/SW Prefetching Josefin Hallberg, Tuva Palm and Mats Brorsson Department of Microelectronics and Information Technology (IMIT)
More information1 Storage Devices Summary
Chapter 1 Storage Devices Summary Dependability is vital Suitable measures Latency how long to the first bit arrives Bandwidth/throughput how fast does stuff come through after the latency period Obvious
More informationPerformance Tuning and Optimizing SQL Databases 2016
Performance Tuning and Optimizing SQL Databases 2016 http://www.homnick.com marketing@homnick.com +1.561.988.0567 Boca Raton, Fl USA About this course This four-day instructor-led course provides students
More informationCOLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service
COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service Eddie Dong, Yunhong Jiang 1 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,
More informationNVIDIA GeForce GTX 750 Ti
Whitepaper NVIDIA GeForce GTX 750 Ti Featuring First-Generation Maxwell GPU Technology, Designed for Extreme Performance per Watt V1.1 Table of Contents Table of Contents... 1 Introduction... 3 The Soul
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationImpact of Java Application Server Evolution on Computer System Performance
Impact of Java Application Server Evolution on Computer System Performance Peng-fei Chuang, Celal Ozturk, Khun Ban, Huijun Yan, Kingsum Chow, Resit Sendag Intel Corporation; {peng-fei.chuang, khun.ban,
More informationWeb Servers Outline. Chris Chin, Gregory Seidman, Denise Tso. March 19, 2001
Web Servers Outline Chris Chin, Gregory Seidman, Denise Tso March 19, 2001 I. Introduction A. What is a web server? 1. is it anything that can be retrieved with an URL? 2. (web service architecture diagram)
More informationOC By Arsene Fansi T. POLIMI 2008 1
IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5
More informationE) Modeling Insights: Patterns and Anti-patterns
Murray Woodside, July 2002 Techniques for Deriving Performance Models from Software Designs Murray Woodside Second Part Outline ) Conceptual framework and scenarios ) Layered systems and models C) uilding
More informationStrategies. Addressing and Routing
Strategies Circuit switching: carry bit streams original telephone network Packet switching: store-and-forward messages Internet Spring 2007 CSE 30264 14 Addressing and Routing Address: byte-string that
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
More informationMemory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar
Memory ICS 233 Computer Architecture and Assembly Language Prof. Muhamed Mudawar College of Computer Sciences and Engineering King Fahd University of Petroleum and Minerals Presentation Outline Random
More informationRAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University
RAMCloud and the Low- Latency Datacenter John Ousterhout Stanford University Most important driver for innovation in computer systems: Rise of the datacenter Phase 1: large scale Phase 2: low latency Introduction
More informationHow SSDs Fit in Different Data Center Applications
How SSDs Fit in Different Data Center Applications Tahmid Rahman Senior Technical Marketing Engineer NVM Solutions Group Flash Memory Summit 2012 Santa Clara, CA 1 Agenda SSD market momentum and drivers
More informationSOC architecture and design
SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external
More informationVendor and Hardware Platform: Fujitsu BX924 S2 Virtualization Platform: VMware ESX 4.0 Update 2 (build 261974)
Vendor and Hardware Platform: Fujitsu BX924 S2 Virtualization Platform: VMware ESX 4.0 Update 2 (build 261974) Performance Section Performance Tested By: Fujitsu Test Date: 10-05-2010 Configuration Section
More informationTyche: An efficient Ethernet-based protocol for converged networked storage
Tyche: An efficient Ethernet-based protocol for converged networked storage Pilar González-Férez and Angelos Bilas 30 th International Conference on Massive Storage Systems and Technology MSST 2014 June
More informationConcept of Cache in web proxies
Concept of Cache in web proxies Chan Kit Wai and Somasundaram Meiyappan 1. Introduction Caching is an effective performance enhancing technique that has been used in computer systems for decades. However,
More informationIntel RAID RS25 Series Performance
PERFORMANCE BRIEF Intel RAID RS25 Series Intel RAID RS25 Series Performance including Intel RAID Controllers RS25DB080 & PERFORMANCE SUMMARY Measured IOPS surpass 200,000 IOPS When used with Intel RAID
More informationIntel 965 Express Chipset Family Memory Technology and Configuration Guide
Intel 965 Express Chipset Family Memory Technology and Configuration Guide White Paper - For the Intel 82Q965, 82Q963, 82G965 Graphics and Memory Controller Hub (GMCH) and Intel 82P965 Memory Controller
More informationEfficient Parallel Graph Exploration on Multi-Core CPU and GPU
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Pervasive Parallelism Laboratory Stanford University Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun Graph and its Applications Graph Fundamental
More informationRackspace Cloud Databases and Container-based Virtualization
Rackspace Cloud Databases and Container-based Virtualization August 2012 J.R. Arredondo @jrarredondo Page 1 of 6 INTRODUCTION When Rackspace set out to build the Cloud Databases product, we asked many
More informationICRI-CI Retreat Architecture track
ICRI-CI Retreat Architecture track Uri Weiser June 5 th 2015 - Funnel: Memory Traffic Reduction for Big Data & Machine Learning (Uri) - Accelerators for Big Data & Machine Learning (Ran) - Machine Learning
More informationUsing Synology SSD Technology to Enhance System Performance. Based on DSM 5.2
Using Synology SSD Technology to Enhance System Performance Based on DSM 5.2 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges... 3 SSD Cache as Solution...
More informationINSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER
Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Prof. Silvano, Politecnico di Milano
More informationCS/COE1541: Introduction to Computer Architecture. Memory hierarchy. Sangyeun Cho. Computer Science Department University of Pittsburgh
CS/COE1541: Introduction to Computer Architecture Memory hierarchy Sangyeun Cho Computer Science Department CPU clock rate Apple II MOS 6502 (1975) 1~2MHz Original IBM PC (1981) Intel 8080 4.77MHz Intel
More informationPrefetch-Aware Shared-Resource Management for Multi-Core Systems
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi Chang Joo Lee Onur Mutlu Yale N. Patt HPS Research Group The University of Texas at Austin {ebrahimi, patt}@hps.utexas.edu
More informationRAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29
RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f. one large disk) Parallelism improves performance Plus extra disk(s) for redundant data storage Provides fault tolerant
More informationEE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution
More informationA Quantum Leap in Enterprise Computing
A Quantum Leap in Enterprise Computing Unprecedented Reliability and Scalability in a Multi-Processor Server Product Brief Intel Xeon Processor 7500 Series Whether you ve got data-demanding applications,
More information"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012
"JAGUAR AMD s Next Generation Low Power x86 Core Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012 TWO X86 CORES TUNED FOR TARGET MARKETS Mainstream Client and Server Markets Bulldozer
More informationPerformance Brief: MegaRAID SAS 9265/9285 Series
MegaRAID SAS 9265/9285 Series Performance Brief Performance Brief: MegaRAID SAS 9265/9285 Series Executive Summary PERFORMANCE SUMMARY n Measured IOPS surpass 200,000 IOPS n When used with MegaRAID FastPath
More informationBindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationOperating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015
Operating Systems 05. Threads Paul Krzyzanowski Rutgers University Spring 2015 February 9, 2015 2014-2015 Paul Krzyzanowski 1 Thread of execution Single sequence of instructions Pointed to by the program
More informationE6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
More informationGPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
More informationUltraSPARC T1: A 32-threaded CMP for Servers. James Laudon Distinguished Engineer Sun Microsystems james.laudon@sun.com
UltraSPARC T1: A 32-threaded CMP for Servers James Laudon Distinguished Engineer Sun Microsystems james.laudon@sun.com Outline Page 2 Server design issues > Application demands > System requirements Building
More informationOperating System Scheduling for Efficient Online Self-Test in Robust Systems. Yanjing Li. Onur Mutlu. Subhasish Mitra
Operating System Scheduling for Efficient Online Self-Test in Robust Systems Yanjing Li Stanford University Onur Mutlu Carnegie Mellon University Subhasish Mitra Stanford University 1 Why Online Self-Test
More information