Power-Aware High-Performance Scientific Computing

Size: px

Start display at page:

Download "Power-Aware High-Performance Scientific Computing"

Carmella Johnson
10 years ago
Views:

1 Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University Supported by NSF STHEC: PxP:Co-Managing PerformancexPower

Engineering The Pennsylvania State University http://www.cse.psu.

2 Trends Microprocessor Design & HPC Microprocessor design Gordon Moore, 1966: 2 X # transistors in 18 months= Focus on peak rates, LAPACK benchmarks with dense codes Patrick Gelsinger, 2004: power is the only real limiter DAC Keynote HPC and science through simulation High costs of installation, cooling Petascale system is infeasible without new low-power designs (Simon, Boku ) Gap between peak (TOP500) and sustained rates on real workloads Petascale instrument vs. desktop supercomputing CMPs/multicores and performance, power and productivity issues

costs of installation, cooling Petascale system is infeasible without new low-power designs (Simon, Boku ) Gap between peak (TOP500) and

3 Why Sparse Scientific Codes Sparse codes (irregular meshes, matrices, graphs), unlike tuned dense codes, do not operate at peak rates (despite tuning) Sparse codes represent scalable formulations for many applications but Limited data locality and data re-use Memory and network latency bound Load imbalances despite partitioning/re-partitioning Multiple algorithms, implementations with different quality/performance trade-offs Present many opportunities for adaptive Q(uality)xP(erformance)xP(power) tuning

data re-use Memory and network latency bound Load imbalances despite partitioning/re-partitioning Multiple algorithms,

4 Sparse Codes and Data Example: Sparse y= Ax Used in many PDE simulations in explicit codes, in implicit codes with linear system solution, data clustering with K-means Ordering (RCM) to get locality of access in x Data locality and data reuse for elements of x

system solution, data clustering with K-means Ordering (RCM) to

5 This Presentation Microprocessor/network architectural optimizations X Application features PxP results for sparse scientific computing Optimizing CPU + Memory for sparse PxP PxP models for adaptive feature selection PxP trends on MPPs with CPU+Link scaling Summary and conclusions

Optimizing CPU + Memory for sparse PxP PxP models for adaptive feature

6 PxP Results - I Characterizing power reductions and performance improvements for a single node, i.e., CPU +Memory There is locality of data access in many sparse codes when matrices are reordered, right data structures are used etc. Konrad Malkowski (lead)

7 Power-Aware+ High Performance Computing Power of CMOS chips: P = C * V dd 2 * f + V dd * I leak Typically higher performance = higher f with higher transistor counts thermal limits Tuning Power DVS: Dynamic voltage and frequency scaling for CPUs Drowsy/low-power modes of caches, DRAM memory banks ABB: Adaptive body biasing, reduces I leak If these low-power knobs are exposed in the ISA, they can be used to control power in applications If some of the power savings are directed for memory/network optimizations, we can increase performance while lowering power for PxP reductions in energy

banks ABB: Adaptive body biasing, reduces I leak If these low-power knobs are exposed in the ISA, they can be used to control power in applications

8 Methodology Cycle accurate architectural emulations using Simplescalar, Wattch and Cacti Emulate CPU with caches + off chip DRAM memory starting with a PowerPC-like core (like a BGL processor) Emulate low power modes Model DVS by scaling frequency and supply voltage Model low power modes of caches by emulating smaller caches Emulate memory subsystem optimizations Extend Simplescalar/Wattch to add structures for optimizations to reduce memory latency

by scaling frequency and supply voltage Model low power modes of caches by emulating smaller caches Emulate

9 Base (B) Architecture Power PC-like, 1 GHz core 4 MB SRAM L3 (26 cycle latency) 2 KB SRAM L2 ( 7 cycle latency) 32 KB SRAM L1 instruction and data caches (1 cycle latency) Memory bus: 64 bits Memory size 256 MB (9 x 256Mbit x 8 pins DRAM)

SRAM L1 instruction and data caches (1 cycle latency)

10 Architectural Extensions Wider memory bus: 128 bits, original 64 (W) Memory page policy: Open or Closed (MO) Prefetcher (stride 1) in memory controller (MP) Prefetcher (stride 1) in L2 cache (LP) Load Miss Predictor in L1 cache (LMP) Prefetchers can reduce latency if there is locality of access If sparse matrix is highly irregular (inherent or from implementation) an LMP can avoid latency of cache hierarchy Developed LMP similar to a branch prediction structure

cache (LMP) Prefetchers can reduce latency if there is locality of access If sparse matrix is highly irregular

11 Memory Prefetcher (MP) Added a prefetch buffer to the memory controller 16 element table with 128 byte cache line LRU replacement

12 L2 Cache Prefetcher (LP) Benefits codes with locality of data access but poor data re-use

13 Memory Page Policy: Open / Closed (MO) Accesses to open rows have lower latency Memory control is more complex Access latencies are not as predictable

14 Load Miss Predictor

15 Experiments Base (B), Wider path (W), Memory page policy (MO), Memory prefetcher (MP), L2-prefetcher (LP), Load Miss Prediction (LMP) Base (B) at 1000 MHz Sparse codes SMV-U: no blocking, RCM ordering, 4 matrices SMV-O: Sparsity SMV, 2x2 blocking, RCM ordering, 4 matrices NAS MG Benchmark Full scale application: Driven Cavity Flow Metrics: Time, Power, Energy, Ops/J (shown relative to code at B, 1000 MHz, 4 MB L3 cache)

matrices SMV-O: Sparsity SMV, 2x2 blocking, RCM ordering, 4 matrices NAS MG Benchmark Full scale

16 Relative Time: All features, 300 Mhz 1 GHz, 256 K L3 Values < 1 are faster than at base

17 Relative Time at 600 MHz, Smaller L3 B +W +MO +MP +LP +LMP X-axis: features added incrementally to include all Time for each code at B set to 1 Base at 3 Over 40% performance improvements Without optimizations 40 % performance degradation

for each code at B set to 1 Base at 3 Over 40% performance

18 Relative Power at 600 MHz, Smaller L3 +W +MO +MP +LP +LMP X-axis: features added incrementally to include all Power for each code at B set to 1 Base at 3 Over 66% power saved from DVS (600 Mhz), smallest cache with no performance penalty

for each code at B set to 1 Base at 3 Over 66% power saved

19 Relative Energy at 600 MHz, Smaller L3 X-axis: features added incrementally to include all Energy for each code at B set to 1 Base at 3 Over 80% improvements with all features Without optimizations 40 % savings but with performance penalty

B set to 1 Base at 3 Over 80% improvements with all

20 Ops/J at 600 MHz, Smaller L3 X-axis: features added incrementally to include all Ops/J for each code at B set to 1 Base at 3 Factor 5 improvement in energy efficiency

21 PxP Results - II PxP for a `real driven cavity flow application with typical complex code/algorithm features Sayaka Akioka (lead)

22 Driven Cavity :Relative Time, Energy Time +MP +LP Energy +w +MO +LMP Al l Al l With all features, code is faster by 20% even at 400MHz, with 60% less power, energy

23 PxP Results - III Models to select optimal sets of features subject to performance/power constraints Detecting phases in application Adaptively selecting feature set for each application phase: Reduce power subject to performance constraint Reduce time subject to power constraint Konrad Malkowski (lead)

24 Optimal Feature Sets Least squares fit to derive models of power or time (F feature set combination) per code T a N i i F i i Errors of less than 5% Define workload, select optimal configuration with power constraints, Example: Best time 2-feature set, even workload, < 50% base power At 600 MHz :W+ LP; At 800 MHz: MO +MP

25 S/W Phases & Their H/W Detection Different S/W phases can benefit from different H/W features Challenges: How do known s/w phases correspond to h/w detectable phases? What H/W metric can be used to detect phase change? (lightweight)

26 NAS MG: LSQ and 10M cycle window

27 NAS MG: LSQ and 100K cycle window

28 MG: Min P, T constraint Phase Time Freq. L3 size Page LP MP LMP T P Constraint (MHz) policy Restriction MB MO Interp MB MO - p Interp MB MO p p Remainder MB MO p Restriction MB MO p p p Interp MB MO p Interp MB MC p Remainder MB MC p Restriction MB MO - p p I MB MO p I MB MO - p

29 All Vs Adaptive (Using LSQ) Min Power, T constraint Min Time, P constraint All features on

30 PxP Results: MPPs+ MPI codes Utilizing load imbalance in tree-structured parallel sparse computations for energy savings Apps run for days/weeks % of ideal load/processors ~ hours/days Mahmut Kandemir, F. Li, G. Chen

31 Tree-Based Parallel Sparse Computation Tree node =dense/ sparse data-parallel operations Tree structure dictates data-dependencies A node depends only on subtree rooted at the node Computation in disjoint subtrees can proceed independently Imbalance (despite best data-mapping) can be 10% of ideal load/processor Exploit task-parallelism at lower levels and dataparallelism at higher levels Represents Barnes-Hut, FMM N-body tree-codes, sparse solvers,..

32 Example Participating Processors 0,1,2,3 N 0 70/35 [0,6] N 1 50/25 [0,3] N 2 40/25[4,6] Weight (Computation/Communication) Routing requirements cause conflicts p 0 p 1 p 2 p 3 p 4 p 5 N 3 90/10 [0,1] N 4 85/10 [2,3] N 5 80/10[4,5] p 6 p 7 p 8 N 6 N 7 N 8 N 9 N 10 N 11 N /0 95/0 100/0 100/0 100/0 100/0 120/0 P 0 P 1 P 2 P 3 P 4 P 5 P 6 Critical Path Integrated Link/CPU Voltage Scaling to convert imbalance to energy savings without performance penalties (recursive scheme, multiple passes) Network topology constrains link scaling

33 Energy Consumption Average Savings: CPU-VS (27%), LINK-VS (23%), CPU-LINK-VS (40%)

34 Other Results Non-uniform cache architectures (NUCA) and CMPs NUCA configurations for scientific computing Utilizing network on chip (NOC) with NUCA Sayaka Akioka (in progress) Modeling network PxP TorusSim Tool by Sarah Conner A single collective communication: link shutdown possible for 55%-97% of time No performance penalty + energy savings

35 Summary Substantial single processor PxP improvements For kernels, codes and full applications Time 30% 50% faster Power/energy 50%--80% lower Further savings from LSQ-based H/Q adaptivity Multiprocessor (MPP) PxP scaling trends from CPU-link scaling are promising Near ideal conversion of slack to savings Link shutdown possible 60-97% /collective communication

Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large