Power-Aware High-Performance Scientific Computing

Similar documents

Multicore Parallel Computing with OpenMP

1. Memory technology & Hierarchy

Optimizing Configuration and Application Mapping for MPSoC Architectures

DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER

LS DYNA Performance Benchmarks and Profiling. January 2009

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Network Architecture and Topology

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

High Performance Computing. Course Notes HPC Fundamentals

Overlapping Data Transfer With Application Execution on Clusters

Principles and characteristics of distributed systems and environments

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin.

ECLIPSE Performance Benchmarks and Profiling. January 2009

Energy-aware job scheduler for highperformance

Chapter 1 Computer System Overview

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

AMD Opteron Quad-Core

Performance Monitoring of Parallel Scientific Applications

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking

Mesh Generation and Load Balancing

Binary search tree with SIMD bandwidth optimization using SSE

Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Resource Efficient Computing for Warehouse-scale Datacenters

OC By Arsene Fansi T. POLIMI

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Distributed communication-aware load balancing with TreeMatch in Charm++

CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE

Why Latency Lags Bandwidth, and What it Means to Computing

Motivation: Smartphone Market

Parallel Programming Survey

Photonic Networks for Data Centres and High Performance Computing

Optimizing Shared Resource Contention in HPC Clusters

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

PERFORMANCE TUNING ORACLE RAC ON LINUX

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Physical Data Organization

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

benchmarking Amazon EC2 for high-performance scientific computing

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

Benchmarking Cassandra on Violin

big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices

Clusters: Mainstream Technology for CAE

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Low Power AMD Athlon 64 and AMD Opteron Processors

Keys to node-level performance analysis and threading in HPC applications

McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

Data Centric Systems (DCS)

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

CHAPTER 1 INTRODUCTION

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Why the Network Matters

Networking Virtualization Using FPGAs

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

Multi-Threading Performance on Commodity Multi-Core Processors

Fast Multipole Method for particle interactions: an open source parallel library component

Intel Labs at ISSCC Copyright Intel Corporation 2012

Towards Energy Efficient Query Processing in Database Management System

:Introducing Star-P. The Open Platform for Parallel Application Development. Yoel Jacobsen E&M Computing LTD

Recommended hardware system configurations for ANSYS users

361 Computer Architecture Lecture 14: Cache Memory

Concept of Cache in web proxies

Computer Architecture

Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

PERFORMANCE TOOLS DEVELOPMENTS

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Trends in High-Performance Computing for Power Grid Applications

FLOW-3D Performance Benchmark and Profiling. September 2012

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

OpenSPARC T1 Processor

FPGA-based Multithreading for In-Memory Hash Joins

A NOR Emulation Strategy over NAND Flash Memory

Memory Architecture and Management in a NoC Platform

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks

Rambus Smart Data Acceleration

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Scalability and Classifications

Transcription:

Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan Supported by NSF STHEC: PxP:Co-Managing PerformancexPower

Trends Microprocessor Design & HPC Microprocessor design Gordon Moore, 1966: 2 X # transistors in 18 months= Focus on peak rates, LAPACK benchmarks with dense codes Patrick Gelsinger, 2004: power is the only real limiter DAC Keynote HPC and science through simulation High costs of installation, cooling Petascale system is infeasible without new low-power designs (Simon, Boku ) Gap between peak (TOP500) and sustained rates on real workloads Petascale instrument vs. desktop supercomputing CMPs/multicores and performance, power and productivity issues

Why Sparse Scientific Codes Sparse codes (irregular meshes, matrices, graphs), unlike tuned dense codes, do not operate at peak rates (despite tuning) Sparse codes represent scalable formulations for many applications but Limited data locality and data re-use Memory and network latency bound Load imbalances despite partitioning/re-partitioning Multiple algorithms, implementations with different quality/performance trade-offs Present many opportunities for adaptive Q(uality)xP(erformance)xP(power) tuning

Sparse Codes and Data Example: Sparse y= Ax Used in many PDE simulations in explicit codes, in implicit codes with linear system solution, data clustering with K-means Ordering (RCM) to get locality of access in x Data locality and data reuse for elements of x

This Presentation Microprocessor/network architectural optimizations X Application features PxP results for sparse scientific computing Optimizing CPU + Memory for sparse PxP PxP models for adaptive feature selection PxP trends on MPPs with CPU+Link scaling Summary and conclusions

PxP Results - I Characterizing power reductions and performance improvements for a single node, i.e., CPU +Memory There is locality of data access in many sparse codes when matrices are reordered, right data structures are used etc. Konrad Malkowski (lead)

Power-Aware+ High Performance Computing Power of CMOS chips: P = C * V dd 2 * f + V dd * I leak Typically higher performance = higher f with higher transistor counts thermal limits Tuning Power DVS: Dynamic voltage and frequency scaling for CPUs Drowsy/low-power modes of caches, DRAM memory banks ABB: Adaptive body biasing, reduces I leak If these low-power knobs are exposed in the ISA, they can be used to control power in applications If some of the power savings are directed for memory/network optimizations, we can increase performance while lowering power for PxP reductions in energy

Methodology Cycle accurate architectural emulations using Simplescalar, Wattch and Cacti Emulate CPU with caches + off chip DRAM memory starting with a PowerPC-like core (like a BGL processor) Emulate low power modes Model DVS by scaling frequency and supply voltage Model low power modes of caches by emulating smaller caches Emulate memory subsystem optimizations Extend Simplescalar/Wattch to add structures for optimizations to reduce memory latency

Base (B) Architecture Power PC-like, 1 GHz core 4 MB SRAM L3 (26 cycle latency) 2 KB SRAM L2 ( 7 cycle latency) 32 KB SRAM L1 instruction and data caches (1 cycle latency) Memory bus: 64 bits Memory size 256 MB (9 x 256Mbit x 8 pins DRAM)

Architectural Extensions Wider memory bus: 128 bits, original 64 (W) Memory page policy: Open or Closed (MO) Prefetcher (stride 1) in memory controller (MP) Prefetcher (stride 1) in L2 cache (LP) Load Miss Predictor in L1 cache (LMP) Prefetchers can reduce latency if there is locality of access If sparse matrix is highly irregular (inherent or from implementation) an LMP can avoid latency of cache hierarchy Developed LMP similar to a branch prediction structure

Memory Prefetcher (MP) Added a prefetch buffer to the memory controller 16 element table with 128 byte cache line LRU replacement

L2 Cache Prefetcher (LP) Benefits codes with locality of data access but poor data re-use

Memory Page Policy: Open / Closed (MO) Accesses to open rows have lower latency Memory control is more complex Access latencies are not as predictable

Load Miss Predictor

Experiments Base (B), Wider path (W), Memory page policy (MO), Memory prefetcher (MP), L2-prefetcher (LP), Load Miss Prediction (LMP) Base (B) at 1000 MHz Sparse codes SMV-U: no blocking, RCM ordering, 4 matrices SMV-O: Sparsity SMV, 2x2 blocking, RCM ordering, 4 matrices NAS MG Benchmark Full scale application: Driven Cavity Flow Metrics: Time, Power, Energy, Ops/J (shown relative to code at B, 1000 MHz, 4 MB L3 cache)

Relative Time: All features, 300 Mhz 1 GHz, 256 K L3 Values < 1 are faster than at base

Relative Time at 600 MHz, Smaller L3 B +W +MO +MP +LP +LMP X-axis: features added incrementally to include all Time for each code at B set to 1 Base at 3 Over 40% performance improvements Without optimizations 40 % performance degradation

Relative Power at 600 MHz, Smaller L3 +W +MO +MP +LP +LMP X-axis: features added incrementally to include all Power for each code at B set to 1 Base at 3 Over 66% power saved from DVS (600 Mhz), smallest cache with no performance penalty

Relative Energy at 600 MHz, Smaller L3 X-axis: features added incrementally to include all Energy for each code at B set to 1 Base at 3 Over 80% improvements with all features Without optimizations 40 % savings but with performance penalty

Ops/J at 600 MHz, Smaller L3 X-axis: features added incrementally to include all Ops/J for each code at B set to 1 Base at 3 Factor 5 improvement in energy efficiency

PxP Results - II PxP for a `real driven cavity flow application with typical complex code/algorithm features Sayaka Akioka (lead)

Driven Cavity :Relative Time, Energy Time +MP +LP Energy +w +MO +LMP Al l Al l With all features, code is faster by 20% even at 400MHz, with 60% less power, energy

PxP Results - III Models to select optimal sets of features subject to performance/power constraints Detecting phases in application Adaptively selecting feature set for each application phase: Reduce power subject to performance constraint Reduce time subject to power constraint Konrad Malkowski (lead)

Optimal Feature Sets Least squares fit to derive models of power or time (F feature set combination) per code T a N i i F i i Errors of less than 5% Define workload, select optimal configuration with power constraints, Example: Best time 2-feature set, even workload, < 50% base power At 600 MHz :W+ LP; At 800 MHz: MO +MP

S/W Phases & Their H/W Detection Different S/W phases can benefit from different H/W features Challenges: How do known s/w phases correspond to h/w detectable phases? What H/W metric can be used to detect phase change? (lightweight)

NAS MG: LSQ and 10M cycle window

NAS MG: LSQ and 100K cycle window

MG: Min P, T constraint Phase Time Freq. L3 size Page LP MP LMP T P Constraint (MHz) policy Restriction 1.2 700 1MB MO - - - 1.2 0.29 Interp 1-6 1.2 700 1MB MO - p - 1.19 0.37 Interp 7 1.2 400 4MB MO p p - 1.15 0.29 Remainder 1.2 600 1MB MO p - - 1.13 0.3 Restriction 1 700 1MB MO p p p 0.98 0.37 Interp 1-6 1 800 2MB MO p - - 0.97 0.48 Interp 7 1 500 1MB MC p - - 0.92 0.36 Remainder 1 700 1MB MC p - - 0.97 0.35 Restriction 0.8 800 1MB MO - p p 0.8 0.49 I 1-6 0.8 10002MB MO p - - 0.77 0.85 I 7 0.8 700 1MB MO - p - 0.76 0.5

All Vs Adaptive (Using LSQ) Min Power, T constraint Min Time, P constraint All features on

PxP Results: MPPs+ MPI codes Utilizing load imbalance in tree-structured parallel sparse computations for energy savings Apps run for days/weeks --- 10% of ideal load/processors ~ hours/days Mahmut Kandemir, F. Li, G. Chen

Tree-Based Parallel Sparse Computation Tree node =dense/ sparse data-parallel operations Tree structure dictates data-dependencies A node depends only on subtree rooted at the node Computation in disjoint subtrees can proceed independently Imbalance (despite best data-mapping) can be 10% of ideal load/processor Exploit task-parallelism at lower levels and dataparallelism at higher levels Represents Barnes-Hut, FMM N-body tree-codes, sparse solvers,..

Example Participating Processors 0,1,2,3 N 0 70/35 [0,6] N 1 50/25 [0,3] N 2 40/25[4,6] Weight (Computation/Communication) Routing requirements cause conflicts p 0 p 1 p 2 p 3 p 4 p 5 N 3 90/10 [0,1] N 4 85/10 [2,3] N 5 80/10[4,5] p 6 p 7 p 8 N 6 N 7 N 8 N 9 N 10 N 11 N 12 100/0 95/0 100/0 100/0 100/0 100/0 120/0 P 0 P 1 P 2 P 3 P 4 P 5 P 6 Critical Path Integrated Link/CPU Voltage Scaling to convert imbalance to energy savings without performance penalties (recursive scheme, multiple passes) Network topology constrains link scaling

Energy Consumption Average Savings: CPU-VS (27%), LINK-VS (23%), CPU-LINK-VS (40%)

Other Results Non-uniform cache architectures (NUCA) and CMPs NUCA configurations for scientific computing Utilizing network on chip (NOC) with NUCA Sayaka Akioka (in progress) Modeling network PxP TorusSim Tool by Sarah Conner A single collective communication: link shutdown possible for 55%-97% of time No performance penalty + energy savings

Summary Substantial single processor PxP improvements For kernels, codes and full applications Time 30% 50% faster Power/energy 50%--80% lower Further savings from LSQ-based H/Q adaptivity Multiprocessor (MPP) PxP scaling trends from CPU-link scaling are promising Near ideal conversion of slack to savings Link shutdown possible 60-97% /collective communication