Pedraforca: ARM + GPU prototype

Size: px

Start display at page:

Download "Pedraforca: ARM + GPU prototype"

Vincent Eaton
9 years ago
Views:

1 Pedraforca: ARM + GPU prototype Filippo Mantovani Workshop on exascale and PRACE prototypes Barcelona, 20 May 2014

2 Overview Goals: Test the performance, scalability, and energy efficiency of the ARM multicore processors + high-end GPGPU accelerators Test scalability to high number of compute nodes Budget: BSC State of the art: Carma BSC: ARM + mobile Nvidia GPU Workshop on exascale and PRACE prototypes, 20 May

3 Prototype architecture: components Workshop on exascale and PRACE prototypes, 20 May

4 Prototype architecture: housing E4 boxes Bull racks Workshop on exascale and PRACE prototypes, 20 May

5 Prototype architecture: rack layout and network topology 3 bullx 1200 rack 78 compute nodes 2 login nodes 4 36-port InfiniBand switches (MPI) 2 50-port GbE switches (storage) Workshop on exascale and PRACE prototypes, 20 May

36-port InfiniBand switches (MPI) 2 50-port GbE switches

6 Partners + roles BSC PRACE partner Bull System integrator E4 Subcontractor of Bull for system integration Seco Boards provider (Q7 + carrier board) Nvidia GP-GPU provider Support for CUDA software stack Mellanox High speed interconnection provider Support for IB software stack Workshop on exascale and PRACE prototypes, 20 May

provider Support for CUDA software stack Mellanox High speed interconnection

7 Prototype procurement Legal process Public tender with publicity required Spain: amount >60 KEuro EU: amount >200 KEuro Exclusive contract to Bull Timeline: Mar Project start 18 Jan Tender published 25 Feb Proposal deadline 28 Mar Bull proposal accepted 22 Apr Contract signed 28 Aug First node delivered Sep Final installation Workshop on exascale and PRACE prototypes, 20 May

Feb 2013 - Proposal deadline 28 Mar 2013 - Bull proposal accepted 22 Apr 2013 - Contract signed 28 Aug 2013

8 Prototype deployment Power supply Data center hosting the final installation did not have enough power capability SOLUTION: connected to a second power grid, no issues Fans System is completely air cooled: 2 fans per node, >150 in total! No dynamic regulation of revolution speed of the fans Problem of noise Problem of power consumption (~25 W per node) SOLUTION: installed manual speed regulators for fans Temperature sensor of the computer room Installed on top of one of the temperature sensor of the computing room False temperature measurements of part of the data center SOLUTION: move the sensor Workshop on exascale and PRACE prototypes, 20 May

No dynamic regulation of revolution speed of the fans Problem of noise Problem of power consumption (~25 W per node) SOLUTION: installed manual speed regulators for

9 Architectural issues Coherency protocol within the memory controller (it seems that) Ordering of PCIe transaction is not guaranteed between different PCI devices Polling does not work Requires re-write of drivers avoiding polling SOLUTION: Extremely difficult, unless you have strong commitment of providers PCIe bandwidth CPU: 4x Gen1, 1 GB/sec GPU: 16x Gen3, ~15 GB/sec SOLUTION: Impossible to overcome with current technology Memory issues: 2 GB on the host / 5 GB on GPU SOLUTION: Impossible to overcome with current technology Board Management Control Impossibility of monitor/handle remotely behaviour of the system Due to nature of the hardware (embedded not HPC) SOLUTION: Development of special motherboard (with obvious impacts on the prices) Workshop on exascale and PRACE prototypes, 20 May

to overcome with current technology Memory issues: 2 GB on the host / 5 GB on GPU SOLUTION: Impossible to overcome with current technology Board Management Control Impossibility of monitor/handle

Prototype evaluation What works Multi-core CPU GPU + CUDA support GbE interconnection IBoverIP Login nodes HPC software stack GPU 1 NW 1 NW 2 GPU 2 PCIe switch PCIe switch MEM SoC 1 SoC 2 What doesn

10 Prototype evaluation What works Multi-core CPU GPU + CUDA support GbE interconnection IBoverIP Login nodes HPC software stack GPU 1 NW 1 NW 2 GPU 2 PCIe switch PCIe switch MEM SoC 1 SoC 2 What doesn t work InfiniBand over RDMA due to lack of Mellanox support for 32bit platforms No full OpenFabrics Enterprise Distribution (OFED) available for ARM therefore verbs not usable!!! GPUdirect!!!!!! MEM Workshop on exascale and PRACE prototypes, 20 May

RDMA due to lack of Mellanox support for 32bit platforms No full OpenFabrics Enterprise Distribution (OFED)

Advances over State Of The Art Contribution to the development of HPC software ecosystem on ARM Including CUDA support on ARM based architecture Performance evaluation of ARM based architecture Test

11 Advances over State Of The Art Contribution to the development of HPC software ecosystem on ARM Including CUDA support on ARM based architecture Performance evaluation of ARM based architecture Test of ARM based platforms on large scale Pedraforca has been the first large HPC system based on ARM architecture Contribution pointing out limiting factors of current ARM based platforms (last slide of Alex yesterday) Workshop on exascale and PRACE prototypes, 20 May

large scale Pedraforca has been the first large HPC system based on ARM architecture Contribution pointing out

Results: benchmarks CPU stream benchmarks Per-node power consumption Op. Threads Perf. Power Eff. E.Eff. [MB/s] W % [MB/J] Copy 1 1229 68.8 20.5 17.86 2 1633 69.2 27.2 23.60 4 1633 70.3 27.2 23.23 Scale 1 1299 69.

12 Results: benchmarks CPU stream benchmarks Per-node power consumption Op. Threads Perf. Power Eff. E.Eff. [MB/s] W % [MB/J] Copy Scale Sum Triad * Arka desktop node excludes inefficient fans 12

0 21.7 18.83 2 1610 69.4 26.8 23.20 4 1591 70.5 26.5 22.57 Sum 1 750 68.9 12.5 10.89 2 1247 69.2 20.8 18.

13 Results: Lattice Boltzmann on Pedraforca Fluid dynamics simulations evolving a 2D array of particles (double) interacting with their third neighbours. This translates in a regular pattern of: floating point computation collide memory accesses propagate. Propagate Collide Machine Power [W] Performance [GB/s] Perf/Power [GB/J] Time per iteration [ms] Power [W] Performance [GB/s] Perf/Power [GB/J] Time per iteration [ms] Pedraforca Coka * * Coka: 2 x Intel SandyBridge 12-core + 2 x Nvidia K20M (idle Intel MIC was removed for power measurement) Workshop on exascale and PRACE prototypes, 20 May

Propagate Collide Machine Power [W] Performance [GB/s] Perf/Power [GB/J] Time per iteration [ms] Power [W] Performance [GB/s] Perf/Power [GB/J] Time per iteration [ms]

14 Results: Education Teaching PATC Course (last Friday): Programming ARM based prototypes Workshop on exascale and PRACE prototypes, 20 May

15 Results: Product of E4 Computer Engineering Arka EK002 twin server: Workshop on exascale and PRACE prototypes, 20 May

16 Results: Feelings Like when you are a kid: you put a lot of effort in building a beautiful LEGO construction but You miss a few bricks to finish it (InfiniBand support) You do not have friends to play with (underutilized prototype) Workshop on exascale and PRACE prototypes, 20 May

finish it (InfiniBand support) You do not have friends to play with

17 Lessons learned: OK stress unbalanced architectures, but not too much PCIe: Gen1 vs Gen3 RAM: 2GB on host vs 5GB on device Strong commitment of all the parts involved is required No only OEM, but also final providers! How to obtain commitment is an open question: Carlo suggested informing providers Radek suggested pushing providers with the power of a community making appealing/interesting/profitable the needs of the project Size matters: bigger prototype means bigger risk!!! Workshop on exascale and PRACE prototypes, 20 May

How to obtain commitment is an open question: Carlo suggested informing providers Radek suggested pushing providers with the power of

18 Conclusions Large scale cluster with ARM + GPU Some delays in deployment But system is truly new Prototype as network of GPUs: failure due to hw configuration + missing sw support from the providers System software leadership First ARM based HPC system with CUDA Benefit to scientists who ported codes to x86 + CUDA GPU Encourage European industry Embedded industry to develop HPC-ready components E4 Computer Systems commercialising technology in ARKA series Good educational platform 18

HPC system with CUDA Benefit to scientists who ported codes to x86 + CUDA GPU Encourage European industry Embedded

HP ProLiant SL270s Gen8 Server. Evaluation Report

HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich [email protected]