Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi

Similar documents

CACTI 6.0: A Tool to Model Large Caches

On-Chip Interconnection Networks Low-Power Interconnect

Power Reduction Techniques in the SoC Clock Network. Clock Power

Memory Architecture and Management in a NoC Platform

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

Parallel Programming Survey

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Intel Labs at ISSCC Copyright Intel Corporation 2012

Photonic Networks for Data Centres and High Performance Computing

Thread level parallelism

A Comprehensive Memory Modeling Tool and its Application to the Design and Analysis of Future Memory Hierarchies

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures

Power-Aware High-Performance Scientific Computing

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Static-Noise-Margin Analysis of Conventional 6T SRAM Cell at 45nm Technology

Multi-core and Linux* Kernel

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

What is a System on a Chip?

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Computer Systems Structure Main Memory Organization

Chapter 5 :: Memory and Logic Arrays

Computer Architecture

Resource Efficient Computing for Warehouse-scale Datacenters

Infrastructure Matters: POWER8 vs. Xeon x86

This Unit: Caches. CIS 501 Introduction to Computer Architecture. Motivation. Types of Memory

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

SOC architecture and design

FREE-p: Protecting Non-Volatile Memory against both Hard and Soft Errors

Performance Impacts of Non-blocking Caches in Out-of-order Processors

8 Gbps CMOS interface for parallel fiber-optic interconnects

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

18-548/ Associativity 9/16/98. 7 Associativity / Memory System Architecture Philip Koopman September 16, 1998

Binary search tree with SIMD bandwidth optimization using SSE

Communicating with devices

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

PowerPC Microprocessor Clock Modes

路論 Chapter 15 System-Level Physical Design

White Paper. Requirements of Network Virtualization

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

Table Of Contents. Page 2 of 26. *Other brands and names may be claimed as property of others.

CS250 VLSI Systems Design Lecture 8: Memory

361 Computer Architecture Lecture 14: Cache Memory

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

CACTI 3.0: An Integrated Cache Timing, Power, and Area Model

Deploying in a Distributed Environment

Empowering Developers to Estimate App Energy Consumption. Radhika Mittal, UC Berkeley Aman Kansal & Ranveer Chandra, Microsoft Research

Industrial Ethernet How to Keep Your Network Up and Running A Beginner s Guide to Redundancy Standards

Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors

Digital Design for Low Power Systems

On-Chip Communications Network Report

Clocking. Figure by MIT OCW Spring /18/05 L06 Clocks 1

Interconnection Network of OTA-based FPAA

Energy-Efficient, High-Performance Heterogeneous Core Design

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Solving Network Challenges

SRAM Scaling Limit: Its Circuit & Architecture Solutions

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Scalability and Classifications

Lecture 23: Interconnection Networks. Topics: communication latency, centralized and decentralized switches (Appendix E)

Networking Virtualization Using FPGAs

e.g. τ = 12 ps in 180nm, 40 ps in 0.6 µm Delay has two components where, f = Effort Delay (stage effort)= gh p =Parasitic Delay

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Why the Network Matters

Why Latency Lags Bandwidth, and What it Means to Computing

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

- Nishad Nerurkar. - Aniket Mhatre

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

Scalable Internet Services and Load Balancing

A Generic Network Interface Architecture for a Networked Processor Array (NePA)

The Internet of Things: Opportunities & Challenges

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

High Performance Computing. Course Notes HPC Fundamentals

The Reduced Address Space (RAS) for Application Memory Authentication

Symmetric Multiprocessing

OC By Arsene Fansi T. POLIMI

Removing The Linux Routing Cache

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

Router Architectures

Transcription:

Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi University of Utah & HP Labs 1

Large Caches Cache hierarchies will dominate chip area 3D stacked processors with an entire die for on-chip cache could be common Intel Montecito Cache Cache Montecito has two private 12 MB L3 caches (27MB including L2) Long global wires are required to transmit data/address University of Utah 2

Wire Delay/Power Wire delays are costly for performance and power Latencies of 60 cycles to reach ends of a chip at 32nm (@ 5 GHz) 50% of dynamic power is in interconnect switching (Magen et al. SLIP 04) CACTI* access time for 24 MB cache is 90 cycles @ 5GHz, 65nm Tech *version 4 University of Utah 3

Contribution Support for various interconnect models Improved design space exploration Support for modeling Non-Uniform Cache Access (NUCA) University of Utah 4

Cache Design Basics Bitlines Input address Wordline rray Tag a Deco oder Data array Column muxes Sense Amps Comparators Output driver Valid output? Mux drivers Data output Output driver University of Utah 5

Existing Model - CACTI Wordline & bitline delay Decoder delay Wordline & bitline delay Decoder delay Cache model with 4 sub-arrays Cache model with 16 sub-arrays Decoder delay = H-tree delay + logic delay University of Utah 6

Power/Delay Overhead of Wires H-tree delay increases with cache size H-tree power continues to dominate Bitlines are other major contributors to total power 70% 60% 50% 40% 30% 20% 10% 0% H-tree delay percentage H-tree power percentage 2 4 8 16 32 CacheSize(MB) 7

Motivation The dominant role of interconnect t is clear Lack of tool to model interconnect in detail can impede progress Current solutions have limited wire options Orion, CACTI - Weak wire model - No support for modeling Multi-megabyte caches University of Utah 8

CACTI 6.0 Enhancements Incorporation of Different wire models Different router models Grid topology for NUCA Shared bus for UCA Contention values for various cache configurations Methodology to compute optimal NUCA organization Improved interface that enables trade-off analysis Validation analysis University of Utah 9

Full-swing Wires Z X Y University of Utah 10

Full-swing Wires II 10% Delay Three different design points penalty 20% Delay penalty 30% Delay penalty Repeater size Caveat: Repeater sizing and spacing cannot be controlled precisely all the time University of Utah 11

Full-Swing Wires Fast and simple Delay proportional to sqrt(rc) as against RC High bandwidth Can be pipelined - Requires silicon area - High energy - Quadratic dependence on voltage 12

Low-swing wires 400mV 50mV raise 400mV 400mV Differential wires 50mV drop University of Utah 13

Differential Low-swing + Very low-power, can be routed over other modules - Relatively slow, low-bandwidth bandwidth, high area requirement, requires special transmitter and receiver Bitlines are a form of low-swing wire Optimized for speed and area as against power Driver and pre-charger employ full Vdd voltage University of Utah 14

Delay Characteristics Quadratic increase in delay University of Utah 15

Energy Characteristics University of Utah 16

Search Space of CACTI-5 Design space with global wires optimized for delay University of Utah 17

Search Space of CACTI-6 Low-swing 30% Delay Penalty Least Delay Design space with global and low-swing wires University of Utah 18

CACTI Another Limitation Access delay is equal to the delay of slowest subarray Very high hit time for large caches Potential solution NUCA Extend CACTI to model NUCA Employs a separate bus for each cache bank for multi-banked caches Not scalable Exploit different wire types and network design choices to improve the search space University of Utah 19

Non-Uniform Cache Access (NUCA)* Large cache is broken into a number of small banks Employs on-chip network for communication CPU & L1 Access delay α (distance between bank and cache controller) *(Kim et al. ASPLOS 02) Cache banks University of Utah 20

Extension to CACTI On-chip network Wire model based on ITRS 2005 parameters Grid network 3-stage speculative router pipeline Network latency vs Bank access latency tradeoff Iterate t over different bank sizes Calculate the average network delay based on the number of banks and bank sizes Consider contention values for different cache configurations Similarly we also consider power consumed for each organization University of Utah 21

Trade-off Analysis (32 MB Cache) Latency (c cycles) 400 350 300 250 200 150 16 Core CMP 100 50 Total No. of Cycles Network Latency Bank access latency Network contention Cycles 0 2 4 8 16 32 64 No. of Banks 22

Effect of Core Count Conten ntion Cycles 300 250 200 150 100 50 0 16-core 8-core 4-core 2 4 8 16 32 64 Bank Count 23

Power Centric Design (32MB Cache) Ene ergy J 1.E-08 9.E-09 8.E-09 7.E-09 6.E-09 5.E-09 4.E-09 3.E-09 2.E-09 1.E-09 0.E+00 Total Energy Bank Energy Network Energy Power Optimal Point 2 4 8 16 32 64 Bank Count University of Utah 24

Validation HSPICE tool Predictive Technology Model (65nm tech.) Analytical model that employs PTM parameters compared against HSPICE Distributed wordlines, bitlines, low-swing transmitters, wires, receivers Verified to be within 12% University of Utah 25

Case Study: Heterogeneous D-NUCA Dynamic-NUCA Reduces access time by dynamic data movement Near-by banks are accessed more frequentlyentl Heterogeneous Banks Near-by banks are made smaller and hence faster Access to nearby banks consume less power Other banks can be made larger and more power efficient 26

Access Frequency 120.00% 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% 32,768 3,309,568 6,586,368 9,863,168 13,139,968 16,416,768 19,693,568 22,970,368 26,247,168 29,523,968 32,800,768 % request satisfied by x KB of cache 27

Few Heterogeneous Organizations Considered by CACTI Model 1 Model 2 University of Utah 28

Other Applications Exposing wire properties Novel cache pipelining Early lookup, Aggressive lookup (ISCA 07) Flit-reservation flow control (Peh et al., HPCA 00) Novel topologies Hybrid network (ISCA 07) 29

Conclusion Network parameters and contention play a critical role in deciding NUCA organization Wire choices have significant impact on cache properties CACTI 6.0 can identify models that reduce power by a factor of three for a delay penalty of 25% http://www.hpl.hp.com/personal/norman_jouppi/cacti6.html http://www.cs.utah.edu/~rajeev/cacti6/ 30