HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

Similar documents
A Vision for Tomorrow s Hosting Data Center

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

GPU ACCELERATED DATABASES Database Driven OpenCL Programming. Tim Child 3DMashUp CEO

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

AMD Product and Technology Roadmaps

Optimizing SQL Server AlwaysOn Implementations with OCZ s ZD-XL SQL Accelerator

PHYSICAL CORES V. ENHANCED THREADING SOFTWARE: PERFORMANCE EVALUATION WHITEPAPER

MS Exchange Server Acceleration

Answering the Requirements of Flash-Based SSDs in the Virtualized Data Center

Delivering Accelerated SQL Server Performance with OCZ s ZD-XL SQL Accelerator

Accelerating Database Applications on Linux Servers

ECLIPSE Performance Benchmarks and Profiling. January 2009

Accelerating Server Storage Performance on Lenovo ThinkServer

White Paper AMD PROJECT FREESYNC

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Supreme Court of Italy Improves Oracle Database Performance and I/O Access to Court Proceedings with OCZ s PCIe-based Virtualized Solution

Redefining Flash Storage Solution

New!! - Higher performance for Windows and UNIX environments

LS DYNA Performance Benchmarks and Profiling. January 2009

Accelerating MS SQL Server 2012

Family 12h AMD Athlon II Processor Product Data Sheet

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

AMD Processor Performance. AMD Phenom II Processors Discrete Platform Benchmarks December 2008

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Accelerating Microsoft Exchange Servers with I/O Caching

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

HOW MANY USERS CAN I GET ON A SERVER? This is a typical conversation we have with customers considering NVIDIA GRID vgpu:

QLIKVIEW SERVER MEMORY MANAGEMENT AND CPU UTILIZATION

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

EUCIP IT Administrator - Module 1 PC Hardware Syllabus Version 3.0

Motivation: Smartphone Market

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Parallel Firewalls on General-Purpose Graphics Processing Units

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

By Citrix Consulting Services. Citrix Systems, Inc.

Configuring Memory on the HP Business Desktop dx5150

DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering

Virtualizing a Virtual Machine

Summary. Key results at a glance:

solution brief September 2011 Can You Effectively Plan For The Migration And Management of Systems And Applications on Vblock Platforms?

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Leveraging Aparapi to Help Improve Financial Java Application Performance

NVIDIA GeForce GTX 580 GPU Datasheet

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

NVIDIA GRID DASSAULT CATIA V5/V6 SCALABILITY GUIDE. NVIDIA Performance Engineering Labs PerfEngDoc-SG-DSC01v1 March 2016

can you effectively plan for the migration and management of systems and applications on Vblock Platforms?

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Hardware Level IO Benchmarking of PCI Express*

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

THE AMD MISSION 2 AN INTRODUCTION TO AMD NOVEMBER 2014

Parallel Programming Survey

Rambus Smart Data Acceleration

Virtualization Performance Analysis November 2010 Effect of SR-IOV Support in Red Hat KVM on Network Performance in Virtualized Environments

Memory Architecture and Management in a NoC Platform

Leading Virtualization Performance and Energy Efficiency in a Multi-processor Server

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Infor Web UI Sizing and Deployment for a Thin Client Solution

APU/GPGPU-BASED SECURITY SOLUTIONS. Vikenty Frantsev ALTELL CEO

HP Z Turbo Drive PCIe SSD

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Performance Analysis and Software Optimization on Systems Using the LAN91C111

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Cloud Data Center Acceleration 2015

Moving Beyond CPUs in the Cloud: Will FPGAs Sink or Swim?

big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices

Best Practices for Installing and Configuring the Hyper-V Role on the LSI CTS2600 Storage System for Windows 2008

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

How To Compare Two Servers For A Test On A Poweredge R710 And Poweredge G5P (Poweredge) (Power Edge) (Dell) Poweredge Poweredge And Powerpowerpoweredge (Powerpower) G5I (

Compatibility Matrix BES12. September 16, 2015

Tips and Best Practices for Managing a Private Cloud

Linux VM Infrastructure for memory power management

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

XID ERRORS. vr352 May XID Errors

The Transition to PCI Express* for Client SSDs

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

LOAD BALANCING IN THE MODERN DATA CENTER WITH BARRACUDA LOAD BALANCER FDC T740

A Close Look at PCI Express SSDs. Shirish Jamthe Director of System Engineering Virident Systems, Inc. August 2011

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Compatibility Matrix BES10. April 27, Version 10.2 and later

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Transcription:

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU, SOORAJ PUTHOOR, BRADFORD M BECKMANN, MARK D HILL*, STEVEN K REINHARDT, DAVID A WOOD* *University of Wisconsin-Madison Advanced Micro Devices, Inc.

Powerpoint version available on: http://pages.cs.wisc.edu/~powerjg/ 2 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

ABSTRACT Hardware coherence can increase the utility of heterogeneous systems Major bottlenecks in current coherence implementations High bandwidth difficult to support at directory Extreme resource requirements We propose Heterogeneous System Coherence Leverages spatial locality and region coherence Reduces bandwidth by 94% Reduces resource requirements by 95% 4 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

PHYSICAL INTEGRATION 5 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

PHYSICAL INTEGRATION 6 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

PHYSICAL INTEGRATION 7 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

PHYSICAL INTEGRATION Stacked High-bandwidth DRAM GPU CPU Credit: IBM Cores 8 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

LOGICAL INTEGRATION General-purpose GPU computing OpenCL CUDA Heterogeneous Uniform Memory Access (huma) Shared virtual address space Cache coherence Allows new heterogeneous apps 9 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

OUTLINE Motivation Background System overview Cache architecture reminder Heterogeneous System Bottlenecks Heterogeneous System Coherence Details Results Conclusions 10 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

SYSTEM OVERVIEW SYSTEM LEVEL Highbandwidth interconnect Accelerated Processing Unit (APU) DRAM Channels 11 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

SYSTEM OVERVIEW APU GPU compute accesses must stay coherent Direct-access bus (used for graphics) GPU Cluster APU CPU Cluster Directory Arrow thickness bandwidth Invalidation traffic To DRAM 12 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

SYSTEM OVERVIEW GPU CU L1 CU L1 CU L1 I-Fetch / Decode GPU Cluster Very high bandwidth: CU CU CU CU CU CU CU CU CU CU CU CU L2 has high Local miss Scratchpad rate L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 Register File CU Memory CU L1 Ex Ex Ex Ex Ex Ex Ex Ex Ex Ex Ex Ex GPU L2 Cache L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 CU CU CU CU CU CU CU CU CU CU CU CU CU CU CU CU Ex Ex Ex Ex Coalescer To L1 13 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

SYSTEM OVERVIEW CPU Cluster CPU Core Low bandwidth: Low L2 miss rate L1 L2 CPU Core L1 To Dir L1 L1 CPU Core CPU Core 14 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

CACHE ARCHITECTURE REMINDER CPU/GPU L2 CACHE Demand Requests Core Data Responses Demand requests Cache Tag Arrays Searches cache tags from L1 Allocates cache an for MSHR a tag match Tag hit on probe: send MSHRs entry On a directory data to other core On a miss, send probe, check On request a hit, return to directory data to the L1 MSHR Entries Miss Requests Data MSHRs Hit and tags Responses Probe Requests Coherent Network Interface 15 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

DIRECTORY ARCHITECTURE REMINDER DIRECTORY Coherent Block Requests Demand Block requests Directory Searches Tag Array cache Block tags Probe Requests/ from L2 Allocates cache an for MSHR a tag match Responses On a miss, the entry data comes Allocate and send MSHRs from DRAM Probe Request RAM probes to L2 caches Hit MSHR Entries Miss PR Entries To DRAM 16 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

BACKGROUND SUMMARY System under investigation Heterogeneous CPU-GPU on chip High-bandwidth DRAM Directory pipeline complex MSHR array is associative Difficult to pipeline with more than 1 request per cycle Important resources: MSHR entries 17 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

OUTLINE Motivation Background Heterogeneous System Bottlenecks Simulation overview Directory bandwidth MSHRs Performance is significantly affected Heterogeneous System Coherence Details Results Conclusions 18 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

SIMULATION DETAILS gem5 simulator Simple CPU GPU simulator based on AMD GCN All memory requests through gem5 CPU Clock 2 GHz CPU Cores 2 CPU Shared L2 2 MB (16-way banked) GPU Clock 1 GHz Compute Units 32 GPU Shared L2 4 MB (64-way banked) L3 (Memory-side) 16 MB (16-way banked) DRAM DDR3, 16 channels Peak Bandwidth 700 GB/s Baseline Directory 256k entries (8-way banked) Workloads Modified to use huma Rodinia & AMD APP SDK 19 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

GPGPU BENCHMARKS Rodinia benchmarks bp trains the connection weights on a neural network bfs breadth-first search hs performs a transient 2D thermal simulation (5-point stencil) lud matrix decomposition nw performs a global optimization for DNA sequence alignment km does k-means clustering sd speckle-reducing anisotropic diffusion AMD SDK bn bitonic sort dct discrete cosine transform hg histogram mm matrix multiplication 20 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

SYSTEM BOTTLENECKS Difficult to scale directory bandwidth Difficult to multi-port GPU Complicated pipeline Cluster High resource usage APU CPU Cluster Must allocate MSHR for entire duration of request MSHR array difficult to scale Directory High bandwidth Designed to support CPU bandwidth To DRAM 21 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

DIRECTORY TRAFFIC 4.5 Directory accesses per GPU cycle 4 3.5 3 2.5 2 1.5 1 0.5 0 Difficult to support >1 request per cycle bp bfs hs lud nw km sd bn dct hg mm 22 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

RESOURCE USAGE 100000 10000 Maximum MSHRs 1000 100 10 1 Very difficult to scale MSHR array Steady state at 700 GB/s Causes significant back-pressure on L2s bp bfs hs lud nw km sd bn dct hg mm 23 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

PERFORMANCE OF BASELINE COMPARED TO UNCONSTRAINED RESOURCES Slow down 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Back-pressure from limited MSHRs and bandwidth bp bfs hs lud nw km sd bn dct hg mm 24 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

BOTTLENECKS SUMMARY Directory bandwidth Must support up to 4 requests per cycle Difficult to construct pipeline Resource usage MSHRs are a constraining resource Need more than 10,000 Without resource constraints, up to 4x better performance 25 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

OUTLINE Motivation Background Heterogeneous System Bottlenecks Heterogeneous System Coherence Details Overall system design Region buffer design Region directory design Example Hardware complexity Results Conclusions 26 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

BASELINE DIRECTORY COHERENCE GPU Cluster APU CPU Cluster Initialization Kernel Launch Directory Read result To DRAM 27 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

HETEROGENEOUS SYSTEM COHERENCE (HSC) GPU Cluster APU CPU Cluster Initialization Kernel Launch Directory To DRAM 28 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

HETEROGENEOUS SYSTEM COHERENCE (HSC) Direct-access bus GPU Cluster Region Buffer APU CPU Region Cluster Buffer Region buffers coordinate with region directory Region Directory Directory To DRAM 29 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

HSC: EXAMPLE MEMORY REQUEST GPU L2 Cache APU GPU Region Buffer GPU Region Cluster Buffer CPU Region Cluster Buffer Region Directory Region Directory To DRAM 32 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

HSC: L2 CACHE & REGION BUFFER Demand Requests MSHRs Core Data Responses Core Data Responses Region tags and Cache Tag Arrays Cache Tag Arrays permissions Only region-level MSHRs permission traffic Interface for Miss direct-access bus Hit MSHR Entries MSHR Entries Hit Miss Region Buffer Hit Probe Requests Direct Access Bus Interface Miss Requests Miss Hit Probe Data Requests Responses Coherent Coherent Network Network Interface 33 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

HSC: REGION DIRECTORY Region Permission Requests MSHRs Coherent Block Requests MSHRs MSHR Entries MSHR Entries Block Directory Tag Array Region Directory Tag Array Region tags, sharers, and permissions Miss Miss Hit Hit Block Probe Requests/ Responses Block Probe Requests/Responses Probe Request RAM Probe Request RAM PR Entries PR Entries To DRAM 34 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

HSC: HARDWARE COMPLEXITY Region protocols reduce directory size Region directory: 8x fewer entries Region buffers At each L2 cache 1-KB region (16 64-B blocks) 16-K region entries Overprovisioned for low-locality workloads (a) Region Directory Entry Region Tag 18 bits (b) Region Buffer Entry State CPU GPU 2 bits 1 valid bit per cluster Region Tag State B 0 B 1 B 2... B 15 18 bits 2 bits 1 valid bit per block in the region 35 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

HSC SUMMARY Key insight GPU-CPU applications exhibit high spatial locality Use direct-access bus present in systems Offload bandwidth onto direct-access bus Use coherence network only for permission Add region buffer to track region information At each L2 cache Bypass coherence network and directory Replace directory with region directory Significantly reduces total size needed 36 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

OUTLINE Motivation Background Heterogeneous System Bottlenecks Heterogeneous System Coherence Details Results Speed-up Latency of loads Bandwidth MSHR usage Conclusions 37 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

THREE CACHE-COHERENCE PROTOCOLS Broadcast: Null-directory that broadcasts on all requests Baseline: Block-based, mostly inclusive, directory HSC: Region-based directory with 1-KB region size 38 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

HSC PERFORMANCE Normalized speed-up 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Largest slow-downs slowdowns from constrained resources Broadcast Baseline HSC bp bfs hs lud nw km sd bn dct hg mm 39 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

DIRECTORY TRAFFIC REDUCTION Normalized directory bandwidth 1.2 1 0.8 0.6 0.4 0.2 0 broadcast baseline HSC Average bandwidth significantly reduced Theoretical reduction from 16 block regions bp bfs hs lud nw km sd bn dct hg mm 40 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

HSC RESOURCE USAGE Normalized directory MSHRs required 0.25 0.2 0.15 0.1 0.05 0 Maximum MSHRs significantly reduced bp bfs hs lud nw km sd bn dct hg mm 41 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

RESULTS SUMMARY Used a detailed timing simulator for CPU and GPU HSC significantly improves performance Reduces the average load latency Decreases bandwidth requirement of directory HSC reduces the required MSHRs at the directory 42 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

RELATED WORK Coarse-grained coherence Region coherence Applied to snooping systems [Cantin, ISCA 2005] [Moshovos, ISCA 2005] [Zebchuk, MICRO 2007] Extended to directories [Fang, PACT 2013] [Zebchuk, MICRO 2013] Spatiotemporal coherence [Alisafaee, MICRO 2012] Dual-grain directory coherence [Basu, UW-TR 2013] Primarily focused on directory size GPU coherence [Singh et al. HPCA 2013] Intra-GPU coherence 43 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

CONCLUSIONS Hardware coherence can increase the utility of heterogeneous systems Major bottlenecks in current coherence implementations High bandwidth difficult to support at directory Extreme resource requirements We propose Heterogeneous System Coherence Leverages spatial locality and region coherence Reduces bandwidth by 94% Reduces resource requirements by 95% 44 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

Questions? Contact: powerjg@cs.wisc.edu 45 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners. 46 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

Backup Slides

LOAD LATENCY Normalized load latency 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Average load time significantly reduced broadcast baseline HSC bp bfs hs lud nw km sd bn dct hg mm 48 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

EXECUTION TIME BREAKDOWN 120 100 GPU CPU Execution time (%) 80 60 40 20 0 bp bfs hs lud nw km sd bn dct hg mm 49 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46