HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU, SOORAJ PUTHOOR, BRADFORD M BECKMANN, MARK D HILL*, STEVEN K REINHARDT, DAVID A WOOD* *University of Wisconsin-Madison Advanced Micro Devices, Inc.

Powerpoint version available on: http://pages.cs.wisc.edu/~powerjg/ 2 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

ABSTRACT Hardware coherence can increase the utility of heterogeneous systems Major bottlenecks in current coherence implementations High bandwidth difficult to support at directory Extreme resource requirements We propose Heterogeneous System Coherence Leverages spatial locality and region coherence Reduces bandwidth by 94% Reduces resource requirements by 95% 4 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

PHYSICAL INTEGRATION 5 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

PHYSICAL INTEGRATION Stacked High-bandwidth DRAM GPU CPU Credit: IBM Cores 8 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

LOGICAL INTEGRATION General-purpose GPU computing OpenCL CUDA Heterogeneous Uniform Memory Access (huma) Shared virtual address space Cache coherence Allows new heterogeneous apps 9 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

OUTLINE Motivation Background System overview Cache architecture reminder Heterogeneous System Bottlenecks Heterogeneous System Coherence Details Results Conclusions 10 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

SYSTEM OVERVIEW SYSTEM LEVEL Highbandwidth interconnect Accelerated Processing Unit (APU) DRAM Channels 11 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

SYSTEM OVERVIEW APU GPU compute accesses must stay coherent Direct-access bus (used for graphics) GPU Cluster APU CPU Cluster Directory Arrow thickness bandwidth Invalidation traffic To DRAM 12 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

SYSTEM OVERVIEW GPU CU L1 CU L1 CU L1 I-Fetch / Decode GPU Cluster Very high bandwidth: CU CU CU CU CU CU CU CU CU CU CU CU L2 has high Local miss Scratchpad rate L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 Register File CU Memory CU L1 Ex Ex Ex Ex Ex Ex Ex Ex Ex Ex Ex Ex GPU L2 Cache L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 CU CU CU CU CU CU CU CU CU CU CU CU CU CU CU CU Ex Ex Ex Ex Coalescer To L1 13 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

SYSTEM OVERVIEW CPU Cluster CPU Core Low bandwidth: Low L2 miss rate L1 L2 CPU Core L1 To Dir L1 L1 CPU Core CPU Core 14 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

CACHE ARCHITECTURE REMINDER CPU/GPU L2 CACHE Demand Requests Core Data Responses Demand requests Cache Tag Arrays Searches cache tags from L1 Allocates cache an for MSHR a tag match Tag hit on probe: send MSHRs entry On a directory data to other core On a miss, send probe, check On request a hit, return to directory data to the L1 MSHR Entries Miss Requests Data MSHRs Hit and tags Responses Probe Requests Coherent Network Interface 15 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

DIRECTORY ARCHITECTURE REMINDER DIRECTORY Coherent Block Requests Demand Block requests Directory Searches Tag Array cache Block tags Probe Requests/ from L2 Allocates cache an for MSHR a tag match Responses On a miss, the entry data comes Allocate and send MSHRs from DRAM Probe Request RAM probes to L2 caches Hit MSHR Entries Miss PR Entries To DRAM 16 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

BACKGROUND SUMMARY System under investigation Heterogeneous CPU-GPU on chip High-bandwidth DRAM Directory pipeline complex MSHR array is associative Difficult to pipeline with more than 1 request per cycle Important resources: MSHR entries 17 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

OUTLINE Motivation Background Heterogeneous System Bottlenecks Simulation overview Directory bandwidth MSHRs Performance is significantly affected Heterogeneous System Coherence Details Results Conclusions 18 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

SIMULATION DETAILS gem5 simulator Simple CPU GPU simulator based on AMD GCN All memory requests through gem5 CPU Clock 2 GHz CPU Cores 2 CPU Shared L2 2 MB (16-way banked) GPU Clock 1 GHz Compute Units 32 GPU Shared L2 4 MB (64-way banked) L3 (Memory-side) 16 MB (16-way banked) DRAM DDR3, 16 channels Peak Bandwidth 700 GB/s Baseline Directory 256k entries (8-way banked) Workloads Modified to use huma Rodinia & AMD APP SDK 19 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

GPGPU BENCHMARKS Rodinia benchmarks bp trains the connection weights on a neural network bfs breadth-first search hs performs a transient 2D thermal simulation (5-point stencil) lud matrix decomposition nw performs a global optimization for DNA sequence alignment km does k-means clustering sd speckle-reducing anisotropic diffusion AMD SDK bn bitonic sort dct discrete cosine transform hg histogram mm matrix multiplication 20 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

SYSTEM BOTTLENECKS Difficult to scale directory bandwidth Difficult to multi-port GPU Complicated pipeline Cluster High resource usage APU CPU Cluster Must allocate MSHR for entire duration of request MSHR array difficult to scale Directory High bandwidth Designed to support CPU bandwidth To DRAM 21 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

DIRECTORY TRAFFIC 4.5 Directory accesses per GPU cycle 4 3.5 3 2.5 2 1.5 1 0.5 0 Difficult to support >1 request per cycle bp bfs hs lud nw km sd bn dct hg mm 22 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

RESOURCE USAGE 100000 10000 Maximum MSHRs 1000 100 10 1 Very difficult to scale MSHR array Steady state at 700 GB/s Causes significant back-pressure on L2s bp bfs hs lud nw km sd bn dct hg mm 23 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

PERFORMANCE OF BASELINE COMPARED TO UNCONSTRAINED RESOURCES Slow down 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Back-pressure from limited MSHRs and bandwidth bp bfs hs lud nw km sd bn dct hg mm 24 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

BOTTLENECKS SUMMARY Directory bandwidth Must support up to 4 requests per cycle Difficult to construct pipeline Resource usage MSHRs are a constraining resource Need more than 10,000 Without resource constraints, up to 4x better performance 25 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

OUTLINE Motivation Background Heterogeneous System Bottlenecks Heterogeneous System Coherence Details Overall system design Region buffer design Region directory design Example Hardware complexity Results Conclusions 26 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

BASELINE DIRECTORY COHERENCE GPU Cluster APU CPU Cluster Initialization Kernel Launch Directory Read result To DRAM 27 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

HETEROGENEOUS SYSTEM COHERENCE (HSC) GPU Cluster APU CPU Cluster Initialization Kernel Launch Directory To DRAM 28 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

HETEROGENEOUS SYSTEM COHERENCE (HSC) Direct-access bus GPU Cluster Region Buffer APU CPU Region Cluster Buffer Region buffers coordinate with region directory Region Directory Directory To DRAM 29 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

HSC: EXAMPLE MEMORY REQUEST GPU L2 Cache APU GPU Region Buffer GPU Region Cluster Buffer CPU Region Cluster Buffer Region Directory Region Directory To DRAM 32 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

HSC: L2 CACHE & REGION BUFFER Demand Requests MSHRs Core Data Responses Core Data Responses Region tags and Cache Tag Arrays Cache Tag Arrays permissions Only region-level MSHRs permission traffic Interface for Miss direct-access bus Hit MSHR Entries MSHR Entries Hit Miss Region Buffer Hit Probe Requests Direct Access Bus Interface Miss Requests Miss Hit Probe Data Requests Responses Coherent Coherent Network Network Interface 33 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

HSC: REGION DIRECTORY Region Permission Requests MSHRs Coherent Block Requests MSHRs MSHR Entries MSHR Entries Block Directory Tag Array Region Directory Tag Array Region tags, sharers, and permissions Miss Miss Hit Hit Block Probe Requests/ Responses Block Probe Requests/Responses Probe Request RAM Probe Request RAM PR Entries PR Entries To DRAM 34 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

HSC: HARDWARE COMPLEXITY Region protocols reduce directory size Region directory: 8x fewer entries Region buffers At each L2 cache 1-KB region (16 64-B blocks) 16-K region entries Overprovisioned for low-locality workloads (a) Region Directory Entry Region Tag 18 bits (b) Region Buffer Entry State CPU GPU 2 bits 1 valid bit per cluster Region Tag State B 0 B 1 B 2... B 15 18 bits 2 bits 1 valid bit per block in the region 35 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

HSC SUMMARY Key insight GPU-CPU applications exhibit high spatial locality Use direct-access bus present in systems Offload bandwidth onto direct-access bus Use coherence network only for permission Add region buffer to track region information At each L2 cache Bypass coherence network and directory Replace directory with region directory Significantly reduces total size needed 36 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

OUTLINE Motivation Background Heterogeneous System Bottlenecks Heterogeneous System Coherence Details Results Speed-up Latency of loads Bandwidth MSHR usage Conclusions 37 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

THREE CACHE-COHERENCE PROTOCOLS Broadcast: Null-directory that broadcasts on all requests Baseline: Block-based, mostly inclusive, directory HSC: Region-based directory with 1-KB region size 38 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

HSC PERFORMANCE Normalized speed-up 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Largest slow-downs slowdowns from constrained resources Broadcast Baseline HSC bp bfs hs lud nw km sd bn dct hg mm 39 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

DIRECTORY TRAFFIC REDUCTION Normalized directory bandwidth 1.2 1 0.8 0.6 0.4 0.2 0 broadcast baseline HSC Average bandwidth significantly reduced Theoretical reduction from 16 block regions bp bfs hs lud nw km sd bn dct hg mm 40 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

HSC RESOURCE USAGE Normalized directory MSHRs required 0.25 0.2 0.15 0.1 0.05 0 Maximum MSHRs significantly reduced bp bfs hs lud nw km sd bn dct hg mm 41 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

RESULTS SUMMARY Used a detailed timing simulator for CPU and GPU HSC significantly improves performance Reduces the average load latency Decreases bandwidth requirement of directory HSC reduces the required MSHRs at the directory 42 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

RELATED WORK Coarse-grained coherence Region coherence Applied to snooping systems [Cantin, ISCA 2005] [Moshovos, ISCA 2005] [Zebchuk, MICRO 2007] Extended to directories [Fang, PACT 2013] [Zebchuk, MICRO 2013] Spatiotemporal coherence [Alisafaee, MICRO 2012] Dual-grain directory coherence [Basu, UW-TR 2013] Primarily focused on directory size GPU coherence [Singh et al. HPCA 2013] Intra-GPU coherence 43 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

CONCLUSIONS Hardware coherence can increase the utility of heterogeneous systems Major bottlenecks in current coherence implementations High bandwidth difficult to support at directory Extreme resource requirements We propose Heterogeneous System Coherence Leverages spatial locality and region coherence Reduces bandwidth by 94% Reduces resource requirements by 95% 44 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

Questions? Contact: powerjg@cs.wisc.edu 45 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners. 46 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

Backup Slides

LOAD LATENCY Normalized load latency 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Average load time significantly reduced broadcast baseline HSC bp bfs hs lud nw km sd bn dct hg mm 48 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46

EXECUTION TIME BREAKDOWN 120 100 GPU CPU Execution time (%) 80 60 40 20 0 bp bfs hs lud nw km sd bn dct hg mm 49 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46