Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan Supported by NSF STHEC: PxP:Co-Managing PerformancexPower
Trends Microprocessor Design & HPC Microprocessor design Gordon Moore, 1966: 2 X # transistors in 18 months= Focus on peak rates, LAPACK benchmarks with dense codes Patrick Gelsinger, 2004: power is the only real limiter DAC Keynote HPC and science through simulation High costs of installation, cooling Petascale system is infeasible without new low-power designs (Simon, Boku ) Gap between peak (TOP500) and sustained rates on real workloads Petascale instrument vs. desktop supercomputing CMPs/multicores and performance, power and productivity issues
Why Sparse Scientific Codes Sparse codes (irregular meshes, matrices, graphs), unlike tuned dense codes, do not operate at peak rates (despite tuning) Sparse codes represent scalable formulations for many applications but Limited data locality and data re-use Memory and network latency bound Load imbalances despite partitioning/re-partitioning Multiple algorithms, implementations with different quality/performance trade-offs Present many opportunities for adaptive Q(uality)xP(erformance)xP(power) tuning
Sparse Codes and Data Example: Sparse y= Ax Used in many PDE simulations in explicit codes, in implicit codes with linear system solution, data clustering with K-means Ordering (RCM) to get locality of access in x Data locality and data reuse for elements of x
This Presentation Microprocessor/network architectural optimizations X Application features PxP results for sparse scientific computing Optimizing CPU + Memory for sparse PxP PxP models for adaptive feature selection PxP trends on MPPs with CPU+Link scaling Summary and conclusions
PxP Results - I Characterizing power reductions and performance improvements for a single node, i.e., CPU +Memory There is locality of data access in many sparse codes when matrices are reordered, right data structures are used etc. Konrad Malkowski (lead)
Power-Aware+ High Performance Computing Power of CMOS chips: P = C * V dd 2 * f + V dd * I leak Typically higher performance = higher f with higher transistor counts thermal limits Tuning Power DVS: Dynamic voltage and frequency scaling for CPUs Drowsy/low-power modes of caches, DRAM memory banks ABB: Adaptive body biasing, reduces I leak If these low-power knobs are exposed in the ISA, they can be used to control power in applications If some of the power savings are directed for memory/network optimizations, we can increase performance while lowering power for PxP reductions in energy
Methodology Cycle accurate architectural emulations using Simplescalar, Wattch and Cacti Emulate CPU with caches + off chip DRAM memory starting with a PowerPC-like core (like a BGL processor) Emulate low power modes Model DVS by scaling frequency and supply voltage Model low power modes of caches by emulating smaller caches Emulate memory subsystem optimizations Extend Simplescalar/Wattch to add structures for optimizations to reduce memory latency
Base (B) Architecture Power PC-like, 1 GHz core 4 MB SRAM L3 (26 cycle latency) 2 KB SRAM L2 ( 7 cycle latency) 32 KB SRAM L1 instruction and data caches (1 cycle latency) Memory bus: 64 bits Memory size 256 MB (9 x 256Mbit x 8 pins DRAM)
Architectural Extensions Wider memory bus: 128 bits, original 64 (W) Memory page policy: Open or Closed (MO) Prefetcher (stride 1) in memory controller (MP) Prefetcher (stride 1) in L2 cache (LP) Load Miss Predictor in L1 cache (LMP) Prefetchers can reduce latency if there is locality of access If sparse matrix is highly irregular (inherent or from implementation) an LMP can avoid latency of cache hierarchy Developed LMP similar to a branch prediction structure
Memory Prefetcher (MP) Added a prefetch buffer to the memory controller 16 element table with 128 byte cache line LRU replacement
L2 Cache Prefetcher (LP) Benefits codes with locality of data access but poor data re-use
Memory Page Policy: Open / Closed (MO) Accesses to open rows have lower latency Memory control is more complex Access latencies are not as predictable
Load Miss Predictor
Experiments Base (B), Wider path (W), Memory page policy (MO), Memory prefetcher (MP), L2-prefetcher (LP), Load Miss Prediction (LMP) Base (B) at 1000 MHz Sparse codes SMV-U: no blocking, RCM ordering, 4 matrices SMV-O: Sparsity SMV, 2x2 blocking, RCM ordering, 4 matrices NAS MG Benchmark Full scale application: Driven Cavity Flow Metrics: Time, Power, Energy, Ops/J (shown relative to code at B, 1000 MHz, 4 MB L3 cache)
Relative Time: All features, 300 Mhz 1 GHz, 256 K L3 Values < 1 are faster than at base
Relative Time at 600 MHz, Smaller L3 B +W +MO +MP +LP +LMP X-axis: features added incrementally to include all Time for each code at B set to 1 Base at 3 Over 40% performance improvements Without optimizations 40 % performance degradation
Relative Power at 600 MHz, Smaller L3 +W +MO +MP +LP +LMP X-axis: features added incrementally to include all Power for each code at B set to 1 Base at 3 Over 66% power saved from DVS (600 Mhz), smallest cache with no performance penalty
Relative Energy at 600 MHz, Smaller L3 X-axis: features added incrementally to include all Energy for each code at B set to 1 Base at 3 Over 80% improvements with all features Without optimizations 40 % savings but with performance penalty
Ops/J at 600 MHz, Smaller L3 X-axis: features added incrementally to include all Ops/J for each code at B set to 1 Base at 3 Factor 5 improvement in energy efficiency
PxP Results - II PxP for a `real driven cavity flow application with typical complex code/algorithm features Sayaka Akioka (lead)
Driven Cavity :Relative Time, Energy Time +MP +LP Energy +w +MO +LMP Al l Al l With all features, code is faster by 20% even at 400MHz, with 60% less power, energy
PxP Results - III Models to select optimal sets of features subject to performance/power constraints Detecting phases in application Adaptively selecting feature set for each application phase: Reduce power subject to performance constraint Reduce time subject to power constraint Konrad Malkowski (lead)
Optimal Feature Sets Least squares fit to derive models of power or time (F feature set combination) per code T a N i i F i i Errors of less than 5% Define workload, select optimal configuration with power constraints, Example: Best time 2-feature set, even workload, < 50% base power At 600 MHz :W+ LP; At 800 MHz: MO +MP
S/W Phases & Their H/W Detection Different S/W phases can benefit from different H/W features Challenges: How do known s/w phases correspond to h/w detectable phases? What H/W metric can be used to detect phase change? (lightweight)
NAS MG: LSQ and 10M cycle window
NAS MG: LSQ and 100K cycle window
MG: Min P, T constraint Phase Time Freq. L3 size Page LP MP LMP T P Constraint (MHz) policy Restriction 1.2 700 1MB MO - - - 1.2 0.29 Interp 1-6 1.2 700 1MB MO - p - 1.19 0.37 Interp 7 1.2 400 4MB MO p p - 1.15 0.29 Remainder 1.2 600 1MB MO p - - 1.13 0.3 Restriction 1 700 1MB MO p p p 0.98 0.37 Interp 1-6 1 800 2MB MO p - - 0.97 0.48 Interp 7 1 500 1MB MC p - - 0.92 0.36 Remainder 1 700 1MB MC p - - 0.97 0.35 Restriction 0.8 800 1MB MO - p p 0.8 0.49 I 1-6 0.8 10002MB MO p - - 0.77 0.85 I 7 0.8 700 1MB MO - p - 0.76 0.5
All Vs Adaptive (Using LSQ) Min Power, T constraint Min Time, P constraint All features on
PxP Results: MPPs+ MPI codes Utilizing load imbalance in tree-structured parallel sparse computations for energy savings Apps run for days/weeks --- 10% of ideal load/processors ~ hours/days Mahmut Kandemir, F. Li, G. Chen
Tree-Based Parallel Sparse Computation Tree node =dense/ sparse data-parallel operations Tree structure dictates data-dependencies A node depends only on subtree rooted at the node Computation in disjoint subtrees can proceed independently Imbalance (despite best data-mapping) can be 10% of ideal load/processor Exploit task-parallelism at lower levels and dataparallelism at higher levels Represents Barnes-Hut, FMM N-body tree-codes, sparse solvers,..
Example Participating Processors 0,1,2,3 N 0 70/35 [0,6] N 1 50/25 [0,3] N 2 40/25[4,6] Weight (Computation/Communication) Routing requirements cause conflicts p 0 p 1 p 2 p 3 p 4 p 5 N 3 90/10 [0,1] N 4 85/10 [2,3] N 5 80/10[4,5] p 6 p 7 p 8 N 6 N 7 N 8 N 9 N 10 N 11 N 12 100/0 95/0 100/0 100/0 100/0 100/0 120/0 P 0 P 1 P 2 P 3 P 4 P 5 P 6 Critical Path Integrated Link/CPU Voltage Scaling to convert imbalance to energy savings without performance penalties (recursive scheme, multiple passes) Network topology constrains link scaling
Energy Consumption Average Savings: CPU-VS (27%), LINK-VS (23%), CPU-LINK-VS (40%)
Other Results Non-uniform cache architectures (NUCA) and CMPs NUCA configurations for scientific computing Utilizing network on chip (NOC) with NUCA Sayaka Akioka (in progress) Modeling network PxP TorusSim Tool by Sarah Conner A single collective communication: link shutdown possible for 55%-97% of time No performance penalty + energy savings
Summary Substantial single processor PxP improvements For kernels, codes and full applications Time 30% 50% faster Power/energy 50%--80% lower Further savings from LSQ-based H/Q adaptivity Multiprocessor (MPP) PxP scaling trends from CPU-link scaling are promising Near ideal conversion of slack to savings Link shutdown possible 60-97% /collective communication