In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches

Similar documents
In-Network Monitoring and Control Policy for DVFS of CMP Networks-on-Chip and Last Level Caches

Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors

Asynchronous Bypass Channels

Computer Science 146/246 Homework #3

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

Concept of Cache in web proxies

Optimizing Configuration and Application Mapping for MPSoC Architectures

Optimizing Shared Resource Contention in HPC Clusters

Hardware Implementation of Improved Adaptive NoC Router with Flit Flow History based Load Balancing Selection Strategy

On-Chip Communications Network Report

Runtime Hardware Reconfiguration using Machine Learning

McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures

The Classical Architecture. Storage 1 / 36

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

How to Optimize 3D CMP Cache Hierarchy

Area 3: Analog and Digital Electronics. D.A. Johns

Networking Virtualization Using FPGAs

Power-Aware High-Performance Scientific Computing

Low Power AMD Athlon 64 and AMD Opteron Processors

CHAPTER 1 INTRODUCTION

Using Shortest Job First Scheduling in Green Cloud Computing

Power Management in Cloud Computing using Green Algorithm. -Kushal Mehta COP 6087 University of Central Florida

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

Host Power Management in VMware vsphere 5

PowerPC Microprocessor Clock Modes

Motivation: Smartphone Market

Reduced Precision Hardware for Ray Tracing. Sean Keely University of Texas, Austin

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems*

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Load Balancing & DFS Primitives for Efficient Multicore Applications

Alpha CPU and Clock Design Evolution

On-Chip Interconnection Networks Low-Power Interconnect

EVALUATING POWER MANAGEMENT CAPABILITIES OF LOW-POWER CLOUD PLATFORMS. Jens Smeds

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Intel Xeon +FPGA Platform for the Data Center

Distributed Elastic Switch Architecture for efficient Networks-on-FPGAs

From Bus and Crossbar to Network-On-Chip. Arteris S.A.

Algorithm Design for Performance Aware VM Consolidation

Synthetic Traffic Models that Capture Cache Coherent Behaviour. Mario Badr

TRACKER: A Low Overhead Adaptive NoC Router with Load Balancing Selection Strategy

Balasubramanian Ramachandran

Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources

3D On-chip Data Center Networks Using Circuit Switches and Packet Switches

A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing

Pipelining Review and Its Limitations

A case study of mobile SoC architecture design based on transaction-level modeling

Update on big.little scheduling experiments. Morten Rasmussen Technology Researcher

Characterizing Task Usage Shapes in Google s Compute Clusters

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

LogiCORE IP AXI Performance Monitor v2.00.a

Analysis of Power Management Strategies for a Single-Chip Cloud Computer

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

VLSI Design Verification and Testing

Curriculum for a Master s Degree in ECE with focus on Mixed Signal SOC Design

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM

STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

Xeon+FPGA Platform for the Data Center

These help quantify the quality of a design from different perspectives: Cost Functionality Robustness Performance Energy consumption

A 2-Slot Time-Division Multiplexing (TDM) Interconnect Network for Gigascale Integration (GSI)

Catnap: Energy Proportional Multiple Network-on-Chip

Fault Modeling. Why model faults? Some real defects in VLSI and PCB Common fault models Stuck-at faults. Transistor faults Summary

Dynamic Congestion-Based Load Balanced Routing in Optical Burst-Switched Networks

Binary search tree with SIMD bandwidth optimization using SSE

AN OVERVIEW OF QUALITY OF SERVICE COMPUTER NETWORK

CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING

Towards energy-aware scheduling in data centers using machine learning

CONTINUOUS scaling of CMOS technology makes it possible

The continuum of data management techniques for explicitly managed systems

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Design Verification and Test of Digital VLSI Circuits NPTEL Video Course. Module-VII Lecture-I Introduction to Digital VLSI Testing

Memory Architecture and Management in a NoC Platform

Utilization Driven Power-Aware Parallel Job Scheduling

White Paper FPGA Performance Benchmarking Methodology

Ariadne A Secure On-Demand Routing Protocol for Ad-Hoc Networks

Inter-socket Victim Cacheing for Platform Power Reduction

Dynamic Source Routing in Ad Hoc Wireless Networks

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Memory Access Control in Multiprocessor for Real-time Systems with Mixed Criticality

Implementation of a Lightweight Service Advertisement and Discovery Protocol for Mobile Ad hoc Networks

Storage at a Distance; Using RoCE as a WAN Transport

Bloom Filter based Inter-domain Name Resolution: A Feasibility Study

Architectural Alternatives for Energy Efficient Performance Scaling

Capacity Estimation for Linux Workloads

Efficient Interconnect Design with Novel Repeater Insertion for Low Power Applications

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

路 論 Chapter 15 System-Level Physical Design

Memory Performance at Reduced CPU Clock Speeds: An Analysis of Current x86 64 Processors

CPU Scheduling. Core Definitions

Project Proposal. Data Storage / Retrieval with Access Control, Security and Pre-Fetching

Selection of Techniques and Metrics

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

benchmarking alone. Power and performance

Multicore Processor and GPU. Jia Rao Assistant Professor in CS

Cloud Power Cap- A Practical Approach to Per-Host Resource Management

CHAPTER 1 INTRODUCTION

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Prevention, Detection, Mitigation

Transcription:

In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches Xi Chen 1, Zheng Xu 1, Hyungjun Kim 1, Paul V. Gratz 1, Jiang Hu 1, Michael Kishinevsky 2 and Umit Ogras 2 1 Computer Engineering and Systems Group, Department of ECE, Texas A&M University 2 Strategic CAD Labs, Intel Corp.

Computer Engineering and Systems Group 2 Introduction The Power/Performance Challenge VLSI Technology Trends Continued transistor scaling More transistors Traditional VLSI gains stop Power increasing and transistor performance stagnant Achieving performance in modern VLSI Multi-core/CMP for performance NoCs for communication CMP power management to permit further performance gains and new challenges

Computer Engineering and Systems Group 3 Core Power Management up core L1i L2 L1d Typically power management covers only the core and lower-level caches Simpler problem (relatively speaking) All performance information locally available Instructions per cycle Lower-level cache miss rates Idle time Each core can act independently Performance scales approximately linearly with frequency Cores are only part of the problem Power management in the uncore is a different domain

Typical Chip-Multiprocessors Chip-multiprocessors (CMPs): Complexity moves from the cores up the memory system hierarchy. up core L1i L2 L1d L3 cache slice Dir R Multi-level hierarchies Private lower levels Shared last-level Networks-on-chip for: Cache block transfers Cache coherence Computer Engineering and Systems Group 4

CMP Power Management Challenge Chip-multiprocessors (CMPs): Complexity moves from the cores up the memory system hierarchy. up core L1i L2 L1d L3 cache slice Dir R Multi-level hierarchies Private lower levels Shared last-level Networks-on-chip for: Cache block transfers Cache coherence Large fraction of the power outside of cores LLC shared among many cores (distributed!) Network-on-chip interconnects cores 12 W on the Single Chip Cloud Computer! Indirect impact on system performance Depends upon lower-level cache miss-rates Computer Engineering and Systems Group 5

Computer Engineering and Systems Group 6 CMP DVFS Partitioning Domains per tile

CMP DVFS Partitioning Domains per tile Domains per core Separate domain for uncore Computer Engineering and Systems Group 7

Project Goals Develop a power management policy for a CMP uncore. Maximum savings with minimal impact on performance (< 5% IPC loss). What to monitor? How to propagate information to the central controller? What policy to implement? Computer Engineering and Systems Group 8

Outline Introduction Design Description Uncore Power Management Metrics Information Propagation PID Control Evaluation Conclusions and Future Work Computer Engineering and Systems Group 9

Uncore Power Management Effective uncore power management Inputs: Current performance demand Current power state (DVFS level) Outputs: Next power state Classic control problem Constraints High speed decisions Low hardware overhead Low impact on system from management overheads Computer Engineering and Systems Group 10

Design Outline Three major components to uncore power management: Uncore performance metric Average memory access time (AMAT) Status propagation In-network, unused header portion Control policy PID Control over a fixed time window Computer Engineering and Systems Group 11

Performance Metrics Uncore: LLC + NoC Which performance metric? NoC Centric? Credits Free VCs Per-hop latency LLC Centric? LLC Access rate LLC Miss rate Computer Engineering and Systems Group 12

Performance Metrics Uncore: LLC + NoC Which performance metric? NoC Centric? Credits Free VCs Per-hop latency LLC Centric? LLC Access rate Ultimately who cares about uncore performance? Need a metric that quantifies the memory system s effect on system performance! Average memory access time (AMAT) LLC Miss rate Computer Engineering and Systems Group 13

Computer Engineering and Systems Group 14 Average Memory Access Time AMAT = HitRateL1*AccTimeL1+(1-HitRateL1)* (HitRateL2*AccTimeL2+ ((1-HitRateL2) * LatencyUncore)) Direct measurement memory system performance AMAT increase X yields IPC loss of ~1/2X for small X Experimentally determined AMAT vs Uncore clock rate for two cases: f0 no private hits; f1 all private hits.

Average Memory Access Time AMAT = HitRateL1*AccTimeL1+(1-HitRateL1)* (HitRateL2*AccTimeL2+ ((1-HitRateL2) * LatencyUncore)) Direct measurement memory system performance AMAT increase X yields IPC loss of ~1/2X for small X Experimentally determined AMAT vs Uncore clock rate for two cases: f0 no private hits; f1 all private hits. Note: HitRateL1, HitRateL2, and LatencyUncore require information from each core to calculate weighted averages! Computer Engineering and Systems Group 15

Computer Engineering and Systems Group 16 Information Propagation In-network status packets too costly Bursts of status would impact performance Increased dynamic energy Dedicated status network would be overkill Somewhat low data rate: ~8 bytes per core per 50000-cycle time window Constant power drain

Computer Engineering and Systems Group 17 Information Propagation In-network status packets too costly Bursts of status would impact performance Increased dynamic energy Dedicated status network would be overkill Somewhat low data rate: ~8 bytes per core per 50000-cycle time window Constant power drain Piggieback info in packet headers Link width often an even divisor of cache line size unused space in header No congestion or power impact Status info timeliness?

Information Propagation One power controller node Node 6 in figure Status opportunistically sent Info harvested as packet pass through controller node However, per-core info not received at the end of every window Uncore NoC, grey tile contains perf. monitor. Dashed arrows represent packet paths. Computer Engineering and Systems Group 18

Computer Engineering and Systems Group 19 Extrapolation AMAT calculation requires information from all nodes at the end of each time window Opportunistic piggy-backing provides no guarantees on information timeliness Naïvely using last-packet received leads to bias in weighted average of AMAT Extrapolate packet counts to the end of the time window More accurate weights for AMAT calculation Nodes for which no data is received are excluded from AMAT

Computer Engineering and Systems Group 20 Power Management Controller PID (Proportional-Integral-Derivative) Control Computationally simpler than computer learning techniques More readily and quickly adapts to many different workloads than rule based approaches Theoretical grounds for stability (proof in paper)

Computer Engineering and Systems Group 21 Outline Introduction Design Description Evaluation Methodology Power and Performance Estimated AMAT + PID Vs. Perfect AMAT + PID Vs. Rule-based Analysis Tracking ideal DVFS ratio selection Conclusions and Future Work

Computer Engineering and Systems Group 22 Methodology Memory system traces PARSEC applications M5 trace generation First 250M memory operations Custom Simulator: L1 + L2 + NoC + LLC+ Directory Energy savings calculated based on dynamic power Some benefit to static power as well, future work

Computer Engineering and Systems Group 23 Power and Performance Normalized dynamic energy consumption Normalized performance loss Average of 33% energy savings versus baseline Average of ~5% AMAT loss (<2.5% IPC)

Computer Engineering and Systems Group 24 Comparison vs. Perfect AMAT Normalized dynamic energy consumption Normalized performance loss Virtually identical power savings vs. perfect AMAT Slight loss in performance vs. perfect AMAT

Computer Engineering and Systems Group 25 Comparison vs. Rule-Based Normalized dynamic energy consumption Normalized performance loss Virtually identical power savings vs. Rule-Based 50% less performance loss

Computer Engineering and Systems Group 26 Analysis: PID tracking vs. ideal Generally PID is slightly conservative Reacts quickly and accurately to spikes in need

Computer Engineering and Systems Group 27 Conclusions and Future Work We introduce a power management system for the CMP Uncore Performance metric: estimated AMAT Information propagation: In-network, piggy-backed Control Algorithm: PID 33% energy savings with insignificant performance loss Near ideal AMAT estimation Outperforms rule-based techniques

Computer Engineering and Systems Group 28 Conclusions and Future Work Just scratched the surface here Dynamic cache footprint analysis for LLC power gating Are cycles of uncore utilization predictable? Neural net approaches to control Other predictive techniques Not all misses are equally important Load criticality analysis to improve control

Backup

Computer Engineering and Systems Group 30 DVFS Background Reduce frequency to allow voltage reduction Dynamic power reduces exponentially Static power reduction as well Obvious performance impacts Power management algorithm: Choose best power-performance tradeoff