Adaptive Energy Management Features of the POWER7 TM Processor Michael Floyd POWER7 EnergyScale Architect

Similar documents
Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis

Low Power AMD Athlon 64 and AMD Opteron Processors

Evaluation Report: Accelerating SQL Server Database Performance with the Lenovo Storage S3200 SAN Array

Q & A From Hitachi Data Systems WebTech Presentation:

Evaluation Report: Supporting Microsoft Exchange on the Lenovo S3200 Hybrid Array

Power efficiency and power management in HP ProLiant servers

Cost-Efficient Job Scheduling Strategies

Summary. Key results at a glance:

OC By Arsene Fansi T. POLIMI

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

White Paper Utilizing Leveling Techniques in DDR3 SDRAM Memory Interfaces

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

SAN Conceptual and Design Basics

Switch Fabric Implementation Using Shared Memory

Optimizing VCO PLL Evaluations & PLL Synthesizer Designs

Designing VM2 Application Boards

Linux on POWER for Green Data Center

ClearPath MCP Software Series Compatibility Guide

Power and Cooling Innovations in Dell PowerEdge Servers

Motivation: Smartphone Market

Performance evaluation

Flash Corruption: Software Bug or Supply Voltage Fault?

DDR3 memory technology

Introduction to IBM tools to manage energy consumption

Recommendations for Performance Benchmarking

Introduction to Cloud Computing

Lecture 2 Parallel Programming Platforms

MicroMag3 3-Axis Magnetic Sensor Module

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

Disk Storage Shortfall

DS1104 R&D Controller Board

CHAPTER 1 INTRODUCTION

The changes in each specification and how they compare is shown in the table below. Following the table is a discussion of each of these changes.

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

Host Power Management in VMware vsphere 5.5

Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays

Non-Isolated Analog Voltage/Current Output module IC695ALG704 provides four configurable voltage or current output channels. Isolated +24 VDC Power

big.little Technology: The Future of Mobile Making very high performance available in a mobile envelope without sacrificing energy efficiency

HyperThreading Support in VMware ESX Server 2.1

Hardware Performance Optimization and Tuning. Presenter: Tom Arakelian Assistant: Guy Ingalls

Managing Data Center Power and Cooling

Azul Compute Appliances

Computer Systems Structure Input/Output

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%

PC Solutions That Mean Business

Windows Server Performance Monitoring

PowerPC Microprocessor Clock Modes

Avid ISIS

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

Communicating with devices

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Systemverwaltung 2009 AIX / LPAR

Intel DPDK Boosts Server Appliance Performance White Paper

Know your Cluster Bottlenecks and Maximize Performance

Performance Testing. Configuration Parameters for Performance Testing

HP Smart Array Controllers and basic RAID performance factors

Energy Constrained Resource Scheduling for Cloud Environment

McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Optimizing Shared Resource Contention in HPC Clusters

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

- Nishad Nerurkar. - Aniket Mhatre

Embedded Multi-Media Card Specification (e MMC 4.5)

Communication Protocol

Microsoft SQL Server 2012 on Cisco UCS with iscsi-based Storage Access in VMware ESX Virtualization Environment: Performance Study

Allows the user to protect against inadvertent write operations. Device select and address bytes are Acknowledged Data Bytes are not Acknowledged

Hello, and welcome to this presentation of the STM32 SDMMC controller module. It covers the main features of the controller which is used to connect

Single channel data transceiver module WIZ2-434

Intel architecture. Platform Basics. White Paper Todd Langley Systems Engineer/ Architect Intel Corporation. September 2010

SIGNAL GENERATORS and OSCILLOSCOPE CALIBRATION

Intel RAID Controllers

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Atmel Norway XMEGA Introduction

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

HP ProLiant DL580 Gen8 and HP LE PCIe Workload WHITE PAPER Accelerator 90TB Microsoft SQL Server Data Warehouse Fast Track Reference Architecture

HP recommended configuration for Microsoft Exchange Server 2010: HP LeftHand P4000 SAN

VPX Implementation Serves Shipboard Search and Track Needs

New Features in SANsymphony -V10 Storage Virtualization Software

Intelligent Power Optimization for Higher Server Density Racks

DS1621 Digital Thermometer and Thermostat

Capstone Overview Architecture for Big Data & Machine Learning. Debbie Marr ICRI-CI 2015 Retreat, May 5, 2015

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

LADDER LOGIC/ FLOWCHART PROGRAMMING DIFFERENCES AND EXAMPLES

Resource Efficient Computing for Warehouse-scale Datacenters

Testing Low Power Designs with Power-Aware Test Manage Manufacturing Test Power Issues with DFTMAX and TetraMAX

Green HPC - Dynamic Power Management in HPC

LLRF. Digital RF Stabilization System

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

Capacity Estimation for Linux Workloads

Use-Case Power Management Optimization: Identifying & Tracking Key Power Indicators ELC- E Edimburgh, Patrick Ti:ano

Eureka Technology. Understanding SD, SDIO and MMC Interface. by Eureka Technology Inc. May 26th, Copyright (C) All Rights Reserved

760 Veterans Circle, Warminster, PA Technical Proposal. Submitted by: ACT/Technico 760 Veterans Circle Warminster, PA

SPC BENCHMARK 1/ENERGY EXECUTIVE SUMMARY IBM CORPORATION IBM FLASHSYSTEM 840 SPC-1/E V1.14

Microcontroller for Variable Speed BLDC Fan Control System. T.C. Lun System Engineer, Freescale Semiconductor, Inc.

<Insert Picture Here> Best Practices for Extreme Performance with Data Warehousing on Oracle Database

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

1. Memory technology & Hierarchy

Transcription:

TM Hot Chips 22 August 23, 2010 Adaptive Energy Management Features of the POWER7 TM Processor Michael Floyd POWER7 EnergyScale Architect Bishop Brock, Malcolm Ware, Karthick Rajamani, Alan Drake, Charles Lefurgy & Lorena Pesantez Acknowledgment: This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002

Outline EnergyScale TM Overview POWER7 Energy Management Features New POWER7 Autonomic Mechanisms 2

POWER7 EnergyScale TM Goals System-level performance-aware energy management Implement customer-selected energy management policy Directly measure: performance, utilization, power consumption, temperature Take advantage of workload and environment Save energy when not fully utilized Optimize frequency and voltage to match needs of workload limits of environment Apply benefits to either: Increased performance OR Reduced power at the same performance level 3

EnergyScale Primary Policies In Action 75% Processor Frequency (Mhz) Nominal Max Performance Dynamic Power Save 63% 50% 38% % Max Rated System Power 25% Sampled frequency Sampled Power 1 millisecond samples Shipping EnergyScale policies with representative usage (real customer code) IBM Power 750 Express Server (not fully populated) Highly utilized scientific application with varying workload profile 4 * energy savings vary by workload, system configuration, environment, etc.

EnergyScale = Cooperative Hardware & Firmware Solution P7 System IBM Systems Director Policy management Power & performance measurement Policy Data Dedicated EnergyScale Microcontroller Sense Sense & Control Sense Control Memory Power Supply POWER7 Processor I/O Devices Control Physical Implementation Logical Control Loop Sense Environmental Conditions Sense Customer Workload Customer Policy Decide Actuate Firmware Control Hardware 5

POWER7 Features Sense Dedicated microarchitectural activity & event counters Processor core, memory hierarchy, and main memory access Provide performance, utilization, and activity measurements Used to direct power/performance tradeoff decisions & techniques Physical Locations of Thermal Sensors Digital Thermal Sensor (DTS) 44 on-chip sense points 5 per core chiplet Emergency self-protect thermal throttling Core 0 MemCtrl 0 Off-chip Interconnect Core 1 Core 2 Core 3 SMP Coherence Fabric MemCtrl 1 Critical Path Monitor (CPM) Detects circuit timing margin Assists in choosing optimal frequency & voltage Core 4 Core 5 Core 6 Core 7 Off-Module Interconnect 6

POWER7 Features Decide Dedicated off-chip EnergyScale microcontroller Runs real-time firmware whose sole purpose is to manage system energy Power7 Chip provides dedicated I2C Slave communication port POWER7 accelerators for off-chip microcontroller decisions Reducing communication bandwidth need = Faster control loop response time Sensor packing Multicast function Automated on-chip transaction table to reduce number of read operations to reduce number of write operations to stream out sensor data via single I2C command Offload & automate compute-intensive chores from EnergyScale microcontroller Thermal Sensor Conversion to degrees C using quadratic curve fit Chiplet Power Proxy Calculation Automated Voltage change sequencer (hardware state machine slew assist) 7

POWER7 Features Control Per-core frequency control Digital PLL (DPLL) clock source supports full EnergyScale dynamic range of: -50% to +10% of nominal frequency with 25Mhz resolution Automated fast frequency slew in excess of 50Mhz per us On-chip support for Off-chip voltage control Industry-standard Parallel VID interface for low-end to midrange systems VRM control Serial Voltage command interface to automate multi-step I2C transactions to power supply Necessary to support high-end systems RAS and power delivery requirements Memory (DIMM) power management Power-down and reduced access rate modes Channel-pair level memory activity control Changing Coherence Interconnect Command Rate Incorporating asynchronous core support into the SMP interconnect design was not trivial Coherence command rate limits amount of frequency reduction possible Ability to change coherence command rate on the fly to tune processor versus SMP bandwidth Firmware can learn ideal command rate for workload by setting utilization thresholds 8

Actuate Save Energy When Idle Three idle states were implemented to optimize power vs. latency Nap Optimized for wake-up time Turn off clocks to execution units Caches remain coherent Sleep More savings at higher latency Purge and clock off core plus caches Heavy Sleep All cores sleep mode Reduce voltage of all cores to retention Voltage ramps automatically on wake-up No hardware re-initialization required Typical Wake-Up Latency Nap 15% Processor Idle States clocks Sleep 35% Winkle (future chips) Heavy Sleep voltage 85% Processor Energy Reduction (compared to Idle Loop) 10-20 ms 2 ms 1 ms 5 us 100% 9

Actuate Per-Core Frequency Scaling 1.2 Benefits of Per-Core Frequency Scaling Per-core Run-frequency scaling up to 30% benefit Allows tuning within a partition for non-homogenous workloads (different on each processor core) Relative Core Power (8 Cores) 1 0.8 0.6 0.4 0.2 0 Per-core Nap-frequency scaling up to 20% benefit Increasing Core Voltage Supports energy optimization in partitioned system configurations Less-utilized partitions can run at lower frequencies Heavily utilized partitions maintain peak performance Each partition can run under different energy-savings policy 8 Cores Run Fmax @ V 1 Core Runs Fmax; 7 Cores Nap Fmax 1 Core Runs Fmax, 7 Cores Run Fmin 1 Core Runs Fmax, 7 Cores Nap Fmin Note: highest frequency core determines the required voltage 10

Result EnergyScale Impact with POWER7 SPECPower_ssj2008 runs on a IBM Power 750 Express system** Source: Heather Hanson, IBM research Nominal adding dynamic fan speed control = 14% higher score Two EnergyScale energy-saving modes SPS (Static Power Save): fixed, low-power operating point = improved score almost 25% DPS (Dynamic Power Savings): DVFS with Turbo mode = improved score almost 50% 11 * Results shown on our prototype system, should not be construed as committed capability for a shipping IBM Server. * SPEC and the benchmark name SPECpower_ssj are trademarks of the Standard Performance Evaluation Corporation

New Autonomic EnergyScale Features Exploratory mechanisms available in POWER7 hardware Low Activity Detect Power Proxy Autonomic circuit timing margin feedback control 12

Autonomic Frequency Reduction During Low Activity Memory bound workloads: do not need full processor compute frequency while waiting on data. Systems 100% utilized by traditional metrics may actually be 100% idle in terms of servicing real applications, e.g. while polling for work to arrive. Solution = Low Activity Detect (LAD) 1. Hardware reduces processor frequency in response to drop in instruction throughput 2. EnergyScale algorithms actuate in response to autonomous frequency reduction Minimize service latency impact on work arrival: Hardware is still running instructions and can rapidly restore full frequency using DPLL slewing. Green Polling Software artificially drops instruction throughput to engage autonomic hardware mechanism Useful for: Traditional Idle loops: prefer to avoid overhead of idle state context switch and retain fast reaction time to work arriving in the queue Message-passing work queuing: Idle state not possible. Goal is to react to a change in memory location as fast as possible. 13

Sense Processor Core Power Proxy Processor Core Chiplet DFU Goal: Estimate per-core chiplet power that we cannot directly measure ISU CORE IFU FXU VSU & FPU Method: For each functional unit, pick small subset of activities to infer power consumption (e.g. cache & regfile reads & writes, execution pipeline issue) Core Activity LSU Weight each activity to represent how much relative power it consumes L 2 5 events Combine weighted Core, L2, and L3 activity, then add constant offset plus clock grid power to form: L 3 Chiplet Active Power = (W i * A i ) + C + K*f NCU 4 events Power Proxy = Activity Sense point Result: EnergyScale Firmware adjusts this value for effects of leakage, temperature, and voltage 14

Result Power Proxy Measurements EnergyScale firmware budgets power across multiple processors and memory, used to: Shift power to cores or other components (e.g. memory) that need it the most (Especially important to achieve higher overall performance under a power cap) Enable Server Partition power accounting 15

Reducing wasteful guardband Conventional guardband Static, conservative voltage margins for potential worst-case conditions Causes unnecessary loss of energy efficiency during typical server usage Reclaimed Margin Critical Path Monitor (CPM) Dynamic detection of available circuit timing margin Remove Reduce Input Signal Representative Critical path (tunable) Edge detector No Change Insertion Delay (calibration) Location of edge indicates margin 16

Results Using CPMs for dynamic guardbanding Static guardband: Traditional guardband selection Dynamic guardband: use CPM feedback to optimize frequency or voltage Workload: SPECPower_ssj 100% load level (EnergyScale DPS-FP policy) Running on IBM Power 750 Express Server (32 cores, 64GB @ 22C Ambient) Over-clocking: Improve performance Under-volting: Save energy Normalized transaction rate 120% 110% 100% 90% 80% 70% 60% 50% Static guardband +7.3 % Dynamic guardband +8.4 % Physical limit (where system fails) System AC Power 120% 110% 100% 90% 80% 70% 60% 50% Static guardband -15.8% Dynamic guardband -26.4% Physical limit (where system fails) 17 All cases use same turbo voltage level All cases same frequency & performance Only frequency is adjusted Only voltage is adjusted

Summary POWER7 builds upon initial POWER6 EnergyScale features by including automated on-chip functions and accelerators to assist the off-chip microcontroller firmware New energy management features in POWER7 have shown a 50% improvement in SPECpower score over baseline operation Customers can select the best EnergyScale policy to match their needs, relying on the system to balance power consumption and performance accordingly Autonomous low activity, power proxy, and circuit margin feedback functions provide even more opportunity for energy efficient operation 18 * Results shown on our prototype system, should not be construed as committed capability for a shipping IBM Server. * SPEC and the benchmark name SPECpower_ssj are trademarks of the Standard Performance Evaluation Corporation