www.bsc.es Pedraforca: ARM + GPU prototype Filippo Mantovani Workshop on exascale and PRACE prototypes Barcelona, 20 May 2014
Overview Goals: Test the performance, scalability, and energy efficiency of the ARM multicore processors + high-end GPGPU accelerators Test scalability to high number of compute nodes Budget: 700.000 BSC State of the art: 2012 - Carma cluster @ BSC: ARM + mobile Nvidia GPU Workshop on exascale and PRACE prototypes, 20 May 2014 2
Prototype architecture: components Workshop on exascale and PRACE prototypes, 20 May 2014 3
Prototype architecture: housing E4 boxes Bull racks Workshop on exascale and PRACE prototypes, 20 May 2014 4
Prototype architecture: rack layout and network topology 3 bullx 1200 rack 78 compute nodes 2 login nodes 4 36-port InfiniBand switches (MPI) 2 50-port GbE switches (storage) Workshop on exascale and PRACE prototypes, 20 May 2014 5
Partners + roles BSC PRACE partner Bull System integrator E4 Subcontractor of Bull for system integration Seco Boards provider (Q7 + carrier board) Nvidia GP-GPU provider Support for CUDA software stack Mellanox High speed interconnection provider Support for IB software stack Workshop on exascale and PRACE prototypes, 20 May 2014 6
Prototype procurement Legal process Public tender with publicity required Spain: amount >60 KEuro EU: amount >200 KEuro Exclusive contract to Bull Timeline: Mar 2012 - Project start 18 Jan 2013 - Tender published 25 Feb 2013 - Proposal deadline 28 Mar 2013 - Bull proposal accepted 22 Apr 2013 - Contract signed 28 Aug 2013 - First node delivered Sep 2013 - Final installation Workshop on exascale and PRACE prototypes, 20 May 2014 7
Prototype deployment Power supply Data center hosting the final installation did not have enough power capability SOLUTION: connected to a second power grid, no issues Fans System is completely air cooled: 2 fans per node, >150 in total! No dynamic regulation of revolution speed of the fans Problem of noise Problem of power consumption (~25 W per node) SOLUTION: installed manual speed regulators for fans Temperature sensor of the computer room Installed on top of one of the temperature sensor of the computing room False temperature measurements of part of the data center SOLUTION: move the sensor Workshop on exascale and PRACE prototypes, 20 May 2014 8
Architectural issues Coherency protocol within the memory controller (it seems that) Ordering of PCIe transaction is not guaranteed between different PCI devices Polling does not work Requires re-write of drivers avoiding polling SOLUTION: Extremely difficult, unless you have strong commitment of providers PCIe bandwidth CPU: 4x Gen1, 1 GB/sec GPU: 16x Gen3, ~15 GB/sec SOLUTION: Impossible to overcome with current technology Memory issues: 2 GB on the host / 5 GB on GPU SOLUTION: Impossible to overcome with current technology Board Management Control Impossibility of monitor/handle remotely behaviour of the system Due to nature of the hardware (embedded not HPC) SOLUTION: Development of special motherboard (with obvious impacts on the prices) Workshop on exascale and PRACE prototypes, 20 May 2014 9
Prototype evaluation What works Multi-core CPU GPU + CUDA support GbE interconnection IBoverIP Login nodes HPC software stack GPU 1 NW 1 NW 2 GPU 2 PCIe switch PCIe switch MEM SoC 1 SoC 2 What doesn t work InfiniBand over RDMA due to lack of Mellanox support for 32bit platforms No full OpenFabrics Enterprise Distribution (OFED) available for ARM therefore verbs not usable!!! GPUdirect!!!!!! MEM Workshop on exascale and PRACE prototypes, 20 May 2014 10
Advances over State Of The Art Contribution to the development of HPC software ecosystem on ARM Including CUDA support on ARM based architecture Performance evaluation of ARM based architecture Test of ARM based platforms on large scale Pedraforca has been the first large HPC system based on ARM architecture Contribution pointing out limiting factors of current ARM based platforms (last slide of Alex yesterday) Workshop on exascale and PRACE prototypes, 20 May 2014 11
Results: benchmarks CPU stream benchmarks Per-node power consumption Op. Threads Perf. Power Eff. E.Eff. [MB/s] W % [MB/J] Copy 1 1229 68.8 20.5 17.86 2 1633 69.2 27.2 23.60 4 1633 70.3 27.2 23.23 Scale 1 1299 69.0 21.7 18.83 2 1610 69.4 26.8 23.20 4 1591 70.5 26.5 22.57 Sum 1 750 68.9 12.5 10.89 2 1247 69.2 20.8 18.02 4 1281 70.4 21.4 18.20 Triad 1 755 68.9 12.6 10.96 2 1140 69.3 19.0 16.45 4 1154 70.4 19.2 16.39 * Arka desktop node excludes inefficient fans 12
Results: Lattice Boltzmann on Pedraforca Fluid dynamics simulations evolving a 2D array of particles (double) interacting with their third neighbours. This translates in a regular pattern of: floating point computation collide memory accesses propagate. Propagate Collide Machine Power [W] Performance [GB/s] Perf/Power [GB/J] Time per iteration [ms] Power [W] Performance [GB/s] Perf/Power [GB/J] Time per iteration [ms] Pedraforca 148 129.57 0.88 41.95 187 383.23 2.05 9.58 Coka * 233 128.16 0.55 34.85 300 461.28 1.54 9.68 * Coka: 2 x Intel SandyBridge 12-core + 2 x Nvidia K20M (idle Intel MIC was removed for power measurement) Workshop on exascale and PRACE prototypes, 20 May 2014 13
Results: Education Teaching PATC Course (last Friday): Programming ARM based prototypes Workshop on exascale and PRACE prototypes, 20 May 2014 14
Results: Product of E4 Computer Engineering Arka EK002 twin server: Workshop on exascale and PRACE prototypes, 20 May 2014 15
Results: Feelings Like when you are a kid: you put a lot of effort in building a beautiful LEGO construction but You miss a few bricks to finish it (InfiniBand support) You do not have friends to play with (underutilized prototype) Workshop on exascale and PRACE prototypes, 20 May 2014 16
Lessons learned: OK stress unbalanced architectures, but not too much PCIe: Gen1 vs Gen3 RAM: 2GB on host vs 5GB on device Strong commitment of all the parts involved is required No only OEM, but also final providers! How to obtain commitment is an open question: Carlo suggested informing providers Radek suggested pushing providers with the power of a community making appealing/interesting/profitable the needs of the project Size matters: bigger prototype means bigger risk!!! Workshop on exascale and PRACE prototypes, 20 May 2014 17
Conclusions Large scale cluster with ARM + GPU Some delays in deployment But system is truly new Prototype as network of GPUs: failure due to hw configuration + missing sw support from the providers System software leadership First ARM based HPC system with CUDA Benefit to scientists who ported codes to x86 + CUDA GPU Encourage European industry Embedded industry to develop HPC-ready components E4 Computer Systems commercialising technology in ARKA series Good educational platform 18