Scalable Cache Miss Handling For High MLP

Similar documents
Scalable Cache Miss Handling for High Memory-Level Parallelism

Performance Impacts of Non-blocking Caches in Out-of-order Processors

CAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses

Categories and Subject Descriptors C.1.1 [Processor Architecture]: Single Data Stream Architectures. General Terms Performance, Design.

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Thread level parallelism

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

OC By Arsene Fansi T. POLIMI

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Parallel Programming Survey

Energy-Efficient, High-Performance Heterogeneous Core Design

ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Enterprise Applications

Multithreading Lin Gao cs9244 report, 2006

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Binary search tree with SIMD bandwidth optimization using SSE

Putting it all together: Intel Nehalem.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Comprehensive Hardware and Software Support for Operating Systems to Exploit MP Memory Hierarchies


Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Introduction to Microprocessors

Storage I/O Control: Proportional Allocation of Shared Storage Resources

Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi

On-Chip Interconnection Networks Low-Power Interconnect

Data Memory Alternatives for Multiscalar Processors

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

L20: GPU Architecture and Models

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Pipelining Review and Its Limitations

Computer Architecture TDTS10

Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

POWER8 Performance Analysis

Intel DPDK Boosts Server Appliance Performance White Paper

Architecture Support for Big Data Analytics

Memory Access Control in Multiprocessor for Real-time Systems with Mixed Criticality

Comparison of Hybrid Flash Storage System Performance

Data Center Performance Insurance

Contributions to Gang Scheduling

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

Using Predictive Modeling for Cross-Program Design Space Exploration in Multicore Systems


Accelerating High-Speed Networking with Intel I/O Acceleration Technology

A Lab Course on Computer Architecture

Capstone Overview Architecture for Big Data & Machine Learning. Debbie Marr ICRI-CI 2015 Retreat, May 5, 2015

Guided Performance Analysis with the NVIDIA Visual Profiler

WHITE PAPER Guide to 50% Faster VMs No Hardware Required

A Deduplication File System & Course Review

WHITE PAPER Guide to 50% Faster VMs No Hardware Required

FPGA-based Multithreading for In-Memory Hash Joins

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

PCIe Storage Performance Testing Challenge

MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM?

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

LS DYNA Performance Benchmarks and Profiling. January 2009

HDMQ :Towards In-Order and Exactly-Once Delivery using Hierarchical Distributed Message Queues. Dharmit Patel Faraj Khasib Shiva Srivastava

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

Coming Challenges in Microarchitecture and Architecture

A Hybrid Analytical Modeling of Pending Cache Hits, Data Prefetching, and MSHRs 1

SOS: Software-Based Out-of-Order Scheduling for High-Performance NAND Flash-Based SSDs

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays

Power-Aware High-Performance Scientific Computing

Putting Checkpoints to Work in Thread Level Speculative Execution

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

CMSC 611: Advanced Computer Architecture

Architecture of Hitachi SR-8000

Application Performance Analysis of the Cortex-A9 MPCore

Multi-GPU Load Balancing for Simulation and Rendering

Thread Level Parallelism II: Multithreading

ECLIPSE Performance Benchmarks and Profiling. January 2009

Performance Tuning and Optimizing SQL Databases 2016

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Datacenter Operating Systems

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data

Week 1 out-of-class notes, discussions and sample problems

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Next Generation GPU Architecture Code-named Fermi

Optimizing SQL Server Storage Performance with the PowerEdge R720

Disk Storage Shortfall

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Transcription:

Scalable Cache Miss Handling For High MLP James Tuck, Luis Ceze, and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu

Introduction Checkpointed processors are promising superscalar architectures" Runahead, CPR, Out-of-order commit, CFP, CAVA Deliver high numbers of in-flight instructions" Effectively hide long memory latencies Dramatically increase Memory-Level Parallelism (MLP) Current miss handling structures are woefully under-designed! 2 of 25

Miss Handling Architecture (MHA) Kroft, ISCA 81 Scheurich & Dubois, SC 88 Farkas & Jouppi, ISCA 94 Cache Miss! Core = Miss Information/Status Holding Registers file Cache Primary Secondary Secondary Primary Miss Miss MHA Cache hierarchy Subentry Entry Register in processor Block offset Type (rd/wr) Data (or pointer) 3 of 25

Background on MHA Processor Processor Cache file Unified MHA ed MHA Kroft [ISCAʼ81] proposed first non-blocking cache" file Sohi and Franklin [ISCAʼ91]" Evaluated cache bandwidth file banked with cache 4 of 25

Motivation MHAs must support many more misses" Brute force approach will not do" Imbalance induced processor stall Processor Processor Cache file Unified MHA Centralized design has low bandwidth ed MHA ing may cause access imbalance (and lockup) or inefficient area usage 5 of 25

Proposal: Hierarchical MHA Processor A small per-bank file with Bloom filter" High bandwidth A larger, Shared file" Bloom Filter Bloom Filter Bloom Filter High effective capacity Low lock-up time Shared MHA 6 of 25

Contributions Show that state-of-the-art designs are a significant bottleneck" Propose a Hierarchical MHA to meet high MLP demands" Thoroughly evaluate on Checkpointed processors with SMT and show" Over state-of-the-art, avg. speed-ups of 32% to 95% Over large Unified design, avg. speed-ups of 1% to 18% Performs close to unlimited size MHA 7 of 25

Why not reuse load/store queue state? High MLP: need state in LSQ and in MHA" Could simplify MHA by leveraging complex LSQ " Allocate on primary miss Keep all secondary miss state in LSQ Disadvantage of leveraging LSQ" Induces additional global searches in the LSQ from the cache side Searches would use ID or line address---not word address" Some checkpointed microarchitectures speculatively retire instructions and discard LSQ state LSQ is timing critical: better not put restrictions on it We keep primary and secondary miss info in MHA and rely on no specific LSQ design " 8 of 25

Outline Requirements of new MHAs" Hierarchical MHA" Experimental setup and evaluation" 9 of 25

Requirements for the new MHAs High capacity" Checkpointed Conventional 10 of 25

Requirements for the new MHAs High capacity" High bandwidth" Average increase of 30% 11 of 25

Requirements for the new MHAs High capacity" High bandwidth" Average increase of 30% ed MHAs may suffer from access imbalance lockups" From 15% to 23% slow down Need many entries and subentries" 32 Entries (primary misses) 16 to 32 subentries (secondary misses) These are our design goals 12 of 25

Outline Requirements of new MHAs" Hierarchical MHA" Experimental setup and evaluation" 13 of 25

Hierarchical MHA Secondary miss will often hit in Dedicated file Processor Allocate in Dedicated Dedicated Bloom Filter Dedicated Bloom Filter Dedicated Bloom Filter is Full! Displace to Shared file and Bloom filter Shared MHA Bloom filter averts Shared file accesses 14 of 25

Hierarchical meets design goals Infrequent lock-up while using MHA area efficiently " Processor Use Shared file for displacements High bandwidth" Per-bank Dedicated file Allocate in Dedicated file Locality ensures it is in the Dedicated file" Bloom filter for Shared file Averts most useless accesses to Shared file" Prevents a bottleneck at the Shared file" Bloom Filter Bloom Filter Shared MHA Bloom Filter 15 of 25

Overall organization and timing Dedicated file" Small and fully pipelined Few entries and subentries Per bank Bloom filter" Accessed in parallel with Dedicated file No false negatives Shared file" Highly associative and unpipelined Contains many entries and subentries Bloom Filter Processor Bloom Filter Shared MHA Bloom Filter 16 of 25

Outline Requirements of new MHAs" Hierarchical MHA" Experimental setup and evaluation" 17 of 25

Experimental setup 5 GHz processor" 5-issue, SMT with 2 contexts" Conventional Checkpointed LargeWindow (2K entry ROB) 32 KB Data Cache" 8 banks, 2-way, 64B line, 3 cycle access, 1 port Memory bus bandwidth: 15 GB/s" Workloads: CINT, CFP, Mix" SESC simulator (sesc.sourceforge.net)" 18 of 25

Compare MHAs with the same area 8%, 15%, and 25% of cache area" Area estimated using CACTI 4.1 structures are fully associative Unified, ed, and Hierarchical at each area! Current: 8 misses like Pentium 4" Cache Cache Cache 8% 15% 25% MHA MHA MHA 19 of 25

Performance at 15% area for Checkpointed Current is much worse" Hierarchical is better than Unified and ed" 1 to 18% over Unified 10 to 21% over ed Hierarchical is very close to Unlimited" 20 of 25

Performance at 15% area for other processors Conventional! Less gain across the board LargeWindow! Current bottlenecks the processor Hierarchical outperforms the rest Other architectures can leverage this design" Conventional LargeWindow 21 of 25

Performance at different area points Speedup over ed-15% Checkpointed running Mixes" Unified saturates at 15%" ed continues to increase as it scales up" Hierarchical is most efficient for these areas" 22 of 25

Characterization Bloom filter averts majority of Shared file accesses" On average, from 89% to 95% Most secondary misses hit in the Dedicated file" Reasons for displacing an entry from Dedicated" No free subentries: 18% to 40% No free entries: 60% to 82% 23 of 25

Conclusions State-of-the-art MHA designs are a large bottleneck" Hierarchical speeds-up 32% to 95% over state-of-the-art Brute force Unified & ed designs are suboptimal" Hierarchical speeds-up 1% to 18% over Unified Hierarchical speeds-up 10% to 21% over ed Hierarchical performs best over a range of areas" Additional complexity of Hierarchical is reasonable" 24 of 25

Questions? Scalable Cache Miss Handling For High MLP James Tuck, Luis Ceze, and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu 25 of 25