Performance Evaluation and Energy Efficiency of HPC Platforms

Transcription

1 Performance Evaluation and Energy Efficiency of HPC Platforms Based on Intel, AMD and ARM Processors M. Jarus, S. Varrette, A. Oleksiak and P.Bouvry Poznań Supercomputing and Networking Center CSC, University of Luxembourg, Luxembourg 1 / 26

2 Summary 1 Introduction 2 Context & Motivations 3 Experimental Setup 4 Experiments Results 5 Conclusion 2 / 26

3 Introduction Summary 1 Introduction 2 Context & Motivations 3 Experimental Setup 4 Experiments Results 5 Conclusion 3 / 26

4 Introduction Why High Performance Computing? The country that out-computes will be the one that out-competes. Council on Competitiveness Accelerate research by accelerating computations 14.4 GFlops TFlops (Dual-core i7 1.8GHz) (291computing nodes, 2944cores) Increase storage capacity Communicate faster 2TB (1 disk) 1042TB raw(444disks) 1 GbE (1 Gb/s) vs Infiniband QDR (40 Gb/s) 4 / 26

5 Introduction HPC at the Heart of our Daily Life Today... Research, Industry, Local Collectivities... Tomorrow: applied research, digital health, nano/bio techno N 5 / 26

6 Introduction HPC Evolution towards Exascale Major investments since 2012 to build an Exascale platform by 2019 > 1.5 G$ for each leading countries (US, EU, Russia etc. ) Power 6MW 15MW 20MW #Nodes 18,700 50, ,000 Node concurrency 12 1,000 10,000 Interconnect BW 1.5GB/s 1TB/s 2TB/s MTTI Day Day Day = Max power consumption: 0.1 W per core 6 / 26

7 Introduction Current Leading Processor Technologies Top500 Count Model Example max. TDP % Intel Xeon X5650 6C 2.66GHz 85W 14.1W/core % Intel Xeon E C 2.7GHz 130W 16.25W/core % AMD Opteron C Interlagos 115W 7.2W/core % IBM Power BQC 16C 1.6GHz 65W 4.1W/core alternative low power processor architectures are required. 1 GPGPU accelerators (Nvidia Tesla cards / IBM PowerXCell 8i) 2 mobile and embedded devices market (ARM, Intel Atom) = Can low-power processors really suit HPC? 7 / 26

8 Context & Motivations Summary 1 Introduction 2 Context & Motivations 3 Experimental Setup 4 Experiments Results 5 Conclusion 8 / 26

9 Context & Motivations The Mont Blanc Project www EU project start: October 2011 Objectives develop a ARM-based Exascale HPC using 15 to 30 less energy Current status: Tibidabo cluster based on NVidia Tegra2 SoC based on NVidia Tegra2 SoC (1 ARM Cortex-A9 1 GHz) 8 Q7 board (1 GbE) per blade Total: 128 nodes (38U) interconnect: minimalistic tree network based on 1 GbE switch Measured performance: 120 MFlops/W Other State-of-the-art projects EuroCloud project Energy-conscious3DServer-on-ChipforGreenCloud 9 / 26

10 Context & Motivations [Low-Power] HPC PSNC & UL Name Location Size #cpus #RAM Processor max TDP/proc i7 PSNC 1U GB Intel Core i7-3615qe@2.3ghz 8C 45W 5.63W/c atom64 PSNC 1U 18 36GB Intel Atom N2600@1.6GHz 2C 3.5W 1.75W/c amdf PSNC 1U 18 72GB AMD Fusion G-T40N@1GHz 2C 9W 4.5W/c bull-bcs UL 8U 16 1TB Intel Xeon E7-4850@2GHz 10C 130W 13W/c viridis UL 2U GB ARM A9 Cortex 1.1GHz 4C 1.9W 0.48W/c Objectives Compare perf. of cutting-edge high-density HPC platforms low power platforms (atom64, amdf and viridis) vs. pure computing-efficient platforms (i7 and bull-bcs) 10 / 26

11 Experimental Setup Summary 1 Introduction 2 Context & Motivations 3 Experimental Setup 4 Experiments Results 5 Conclusion 11 / 26

12 Experimental Setup Considered Benchmarks Phoronix Test Suite stressing system-wide components (disk, RAM or CPU). C-ray, Hmmer, Pybench CPU Performances (single-threaded): Coremark, Fhourstones, Whetstone, Linpack MPI Performances: OSU Micro-Benchmarks osu_get_latency & osu_get_bw only HPC Performances: High-Performance Linpack (HPL) solves a linear system of order N: A x = b Gaussian elimination with partial pivoting two-dimensional P Q grid of processes N by N + 1 coefficient matrix split in NB NB blocks 12 / 26

13 Experimental Setup Performance Measurements On a given platform: 100 runs for each benchmark, each with the following operations 1 t 0: [fix the CPU frequency] & start system/power monitoring 2 t 0 + s: Start the selected benchmarks. 3 t 1: Benchmark finished execution. 4 t 1 + s: End of monitoring. PpMHz Performance per MHz impact of CPU frequency on the final benchmarks results i7: 1.2GHz 2.3GHz 2.31GHz (Turbo Mode) atom64: 0.6 GHz 1.6GHz amdf: 0.8 GHz 1GHz; bull-bcs: 1.064GHz 1.995GHz 1.996GHz (Turbo Mode) viridis: 1.1GHz 13 / 26

14 Experimental Setup Performance Measurements PpW raw benchmark result divided by (official) the average power draw (W) (better) the energy consumed (J) Performance per Watt Different results achieved with different CPU frequency values PpW metrics presented for the highest frequency value Technical details viridis: power measures available only by groups of 4 nodes bull-bcs: high latency between measure (slow IPMI) 40s min atom64: strange sensors reporting 14 / 26

15 Experiments Results Summary 1 Introduction 2 Context & Motivations 3 Experimental Setup 4 Experiments Results 5 Conclusion 15 / 26

16 Experiments Results CPU Performances (single-threaded) Raw benchmark result LOGSCALE Intel Core i7 AMD G T40N Atom N2600 Intel Xeon E7 ARM Cortex A9 100 CoreMark Fhourstones Whetstones Linpack Best results are obtained by Intel Core i7, then Intel E7 AMD, ARM and Atom achieve comparable results 16 / 26

17 Experiments Results OSU MPI Benchmark 3.8 Results Latency (µs) LOGSCALE the LOWER the better OSU One Sided MPI Get latency Test v CoolEmAll Atom64 CoolEmAll AMDF CoolEmAll i7 Viridis ARM BullX BCS Packet size (bits) LOGSCALE 17 / 26

18 Experiments Results OSU MPI Benchmark 3.8 Results Bandwidth (MB/s) LOGSCALE the HIGHER the better OSU MPI One Sided MPI Get Bandwidth Test v3.8 BullX BCS 0.1 Viridis ARM CoolEmAll AMDF CoolEmAll i7 CoolEmAll Atom Packet size (bits) LOGSCALE 17 / 26

19 Experiments Results OSU MPI Power Measures OSU Micro benchmark 3.8 CoolEmAll i7 (2 nodes) OSU Micro benchmark 3.8 CoolEmAll AMDF (2 nodes) 60 Latency test Bandwidth test 60 Latency test Bandwidth test 50 Energy=4442J Energy=3093J 50 Energy=3642J Energy=2816J 111s 75s 102s 77s Power usage [W] Power usage [W] Time [s] Time [s] OSU Micro benchmark 3.8 CoolEmAll Atom64 (2 nodes) OSU Micro benchmark 3.8 Viridis ARM (2 nodes) Energy=7555J 275s Energy=4806J 172s Latency test Bandwidth test Energy 499J 45s Energy 526J 45s Latency test Bandwidth test Power usage [W] Power usage [W] Time [s] Time [s] 18 / 26

20 Experiments Results HPL 2.1 Benchmarks Results Single Node Runs i7 Performances [GFlops] HPLinpack 2.1 Single CPU benchmark CoolEmAll i7 40 NB=48, PxQ=2x4 NB=96, PxQ=2x4 NB=128, PxQ=2x4 38 NB=160, PxQ=2x Power usage [W] HPLinpack 2.1 Single CPU benchmark CoolEmAll i7 (N=41185,P=2,Q=4) 60 NB=96 Avg Best Run NB=48 Energy NB=128 NB= J N (Problem size) Time [s] 19 / 26

21 Experiments Results HPL 2.1 Benchmarks Results Single Node Runs amdf Performances [GFlops] HPLinpack 2.1 Single CPU benchmark CoolEmAll AMDF 1.61 NB=96, PxQ=1x2 NB=128, PxQ=1x2 NB=160, PxQ=1x Power usage [W] HPLinpack 2.1 Single CPU benchmark CoolEmAll AMDF (N=19496,P=1,Q=2) 22 NB=160 Avg NB=96 NB=128 Best Run Energy 57912J N (Problem size) Time [s] 19 / 26

22 Experiments Results HPL 2.1 Benchmarks Results Single Node Runs atom64 Performances [GFlops] HPLinpack 2.1 Single CPU benchmark CoolEmAll ATOM NB=48, PxQ=2x2 NB=64, PxQ=2x NB=96, PxQ=2x2 NB=112, PxQ=2x NB=128, PxQ=2x Power usage [W] HPLinpack 2.1 Single CPU benchmark CoolEmAll Atom64 (N=12891,P=2,Q=2) 18 NB=112 Best Run Avg. 17 NB=48 NB=64 NB=96 NB=128 Energy J N (Problem size) Time [s] 19 / 26

23 Experiments Results HPL 2.1 Benchmarks Results Single Node Runs viridis Performances [GFlops] HPLinpack 2.1 Single CPU benchmark Viridis ARM 3.25 NB=64, PxQ=2x2 NB=96, PxQ=2x2 NB=112, PxQ=2x2 3.2 NB=128, PxQ=2x Power usage [W] HPLinpack 2.1 Single CPU benchmark Viridis ARM (N=20711,P=2,Q=2) 6 NB=96 Best Run Avg. 5.8 NB=64 Energy NB=112 NB= J N (Problem size) Time [s] 19 / 26

24 Experiments Results HPL Power Measures Full Platforms Runs i7 amdf HPLinpack 2.1 CoolEmAll i7 platform (18 nodes,n=174733,nb=96,pxq=12x12) HPLinpack 2.1 CoolEmAll AMDF platform (16 nodes,n=77984,nb=160,pxq=4x8) 1000 Energy: J Avg Energy: J Avg Power usage [W] Power usage [W] Time [s] Time [s] 20 / 26

25 Experiments Results HPL Power Measures Full Platforms Runs atom64 bull-bcs HPLinpack 2.1 CoolEmAll Atom64 platform (18 nodes,n=54692,nb=112,pxq=8x9) HPLinpack 2.1 BullX BCS platform (1 nodes, N=87920,NB=112,PxQ=10x16) Energy: J Avg Energy: J Avg. Power usage [W] Power usage [W] Time [s] Time [s] 20 / 26

26 Experiments Results HPL Benchmarks Results Best HPL results Name #cpu R peak N NB P Q Time [s] GFlops Effic. Energy[J] i % amdf ,14% atom ,97% bcs 1 80 n/a viridis % 9983 Full platforms runs Name #nodes R peak N NB P Q Time [s] GFlops Effic. Energy[J] i % amdf % atom % bcs ,75% viridis 12 52, n/a 21 / 26

27 Experiments Results Performance per MHz PpMHz LOGSCALE Intel Core i7 AMD G T40N Atom N2600 Intel Xeon E7 ARM Cortex A9 0.1 OSU Lat. OSU Bw. HPL HPL Full CoreMark Fhourstones Whetstones Linpack PpMHz values remain quite constant under varying CPU frequencies bull-bcs outperforms in all HPC-oriented tests 22 / 26

28 Experiments Results Energy-efficiency Energy [J] LOGSCALE 1e+08 1e+07 1e Intel Core i7 AMD G T40N Atom N2600 Intel Xeon E7 ARM Cortex A9 100 OSU Lat. OSU Bw. HPL HPL Full Cray Hmmer Pybench ARM Cortex A9 is almost always the most energy-efficient CPU Intel Xeon E7 requires much more energy to execute the same application 23 / 26

29 Conclusion Summary 1 Introduction 2 Context & Motivations 3 Experimental Setup 4 Experiments Results 5 Conclusion 24 / 26

30 Conclusion Path to Exascale requires alternative low-power proc. architectures most promising direction based on mobile and embedded devices ARM-based HPC cluster prototypes in the Mont Blanc project Tibidabo cluster: 128 nodes, 38U, 120 MFlops/W Here: performance of cutting-edge high-density HPC platforms CoolEmAll RECS PSNC Boston Viridis & Bull UL Rooms for improvement yet definitively suits HPC environments Best obtained results: Name Processor Type MFlops/W Green500 Rank* viridis ARM A9 Cortex i7 Intel Core i bull-bcs Intel Core E atom64 Intel Atom N amdf AMD Fusion G-T40N * Based on November 2012 list 25 / 26

31 Conclusion Thank you for your attention Introduction 2 Context & Motivations 3 Experimental Setup 4 Experiments Results 5 Conclusion 26 / 26

32 Conclusion CoolEmAll RECS PSNC 35 kw, 1U, up to 18 nodes in a single enclosure 3 enclosure units (3U) PSNC: i7 Intel 2.3GHz 4C HT,TB, 45W TDP atom64 Intel Atom 1.6GHz 2C HT, 3.5W TDP amdf AMD 1.0GHz 2C HT, 9W TDP 27 / 26 N

33 Conclusion Boston Viridis UL 300W, 2U, 10GbE interconnect, 48 ultra low-power SoC ARM Cortex A9 1.1GHz 4C HT, 1.9W TDP 28 / 26

34 Conclusion BullX BCS (4 S6030) UL 8U, aggregation of 4 BullX S6030 in a single SMP node 4 4 Intel Xeon 2GHz 10C HT,TB, 130W TDP 29 / 26