Pedraforca: ARM + GPU prototype



Similar documents
HP ProLiant SL270s Gen8 Server. Evaluation Report

Mellanox Academy Online Training (E-learning)

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

LS DYNA Performance Benchmarks and Profiling. January 2009

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

FLOW-3D Performance Benchmark and Profiling. September 2012

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

ECLIPSE Performance Benchmarks and Profiling. January 2009

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Advancing Applications Performance With InfiniBand

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

White Paper Solarflare High-Performance Computing (HPC) Applications

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

PCI Express Impact on Storage Architectures. Ron Emerick, Sun Microsystems

Parallel Programming Survey

Automating Big Data Benchmarking for Different Architectures with ALOJA

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

Can High-Performance Interconnects Benefit Memcached and Hadoop?

PCI Express IO Virtualization Overview

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

Deep Learning GPU-Based Hardware Platform

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

PCI Express and Storage. Ron Emerick, Sun Microsystems

OpenMP Programming on ScaleMP

INDIAN INSTITUTE OF TECHNOLOGY KANPUR Department of Mechanical Engineering

3G Converged-NICs A Platform for Server I/O to Converged Networks

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

A quick tutorial on Intel's Xeon Phi Coprocessor

Kriterien für ein PetaFlop System

Stream Processing on GPUs Using Distributed Multimedia Middleware

Findings in High-Speed OrthoMosaic

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

Data Centric Systems (DCS)

Choosing the Best Network Interface Card Mellanox ConnectX -3 Pro EN vs. Intel X520

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

PRIMERGY server-based High Performance Computing solutions

SMB Direct for SQL Server and Private Cloud

Intel PCI and PCI Express*

David Vicente Head of User Support BSC

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Thematic Unit of Excellence on Computational Materials Science Solid State and Structural Chemistry Unit, Indian Institute of Science

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

SGI High Performance Computing

RoCE vs. iwarp Competitive Analysis

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Configuration Maximums VMware vsphere 4.0

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Version 1.0

White Paper The Numascale Solution: Extreme BIG DATA Computing

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT

Why ClearCube Technology for VDI?

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

Hadoop on the Gordon Data Intensive Cluster

RDMA over Ethernet - A Preliminary Study

Supercomputing Clusters with RapidIO Interconnect Fabric

Parallel Computing with MATLAB

Experiences With Mobile Processors for Energy Efficient HPC

Storage at a Distance; Using RoCE as a WAN Transport

PCI Express Impact on Storage Architectures and Future Data Centers

Lustre Networking BY PETER J. BRAAM

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

High Performance Computing in CST STUDIO SUITE

1 Bull, 2011 Bull Extreme Computing

Running Native Lustre* Client inside Intel Xeon Phi coprocessor

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Building an energy dashboard. Energy measurement and visualization in current HPC systems

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Michael Kagan.

High Speed I/O Server Computing with InfiniBand

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Recent Advances in HPC for Structural Mechanics Simulations

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Transcription:

www.bsc.es Pedraforca: ARM + GPU prototype Filippo Mantovani Workshop on exascale and PRACE prototypes Barcelona, 20 May 2014

Overview Goals: Test the performance, scalability, and energy efficiency of the ARM multicore processors + high-end GPGPU accelerators Test scalability to high number of compute nodes Budget: 700.000 BSC State of the art: 2012 - Carma cluster @ BSC: ARM + mobile Nvidia GPU Workshop on exascale and PRACE prototypes, 20 May 2014 2

Prototype architecture: components Workshop on exascale and PRACE prototypes, 20 May 2014 3

Prototype architecture: housing E4 boxes Bull racks Workshop on exascale and PRACE prototypes, 20 May 2014 4

Prototype architecture: rack layout and network topology 3 bullx 1200 rack 78 compute nodes 2 login nodes 4 36-port InfiniBand switches (MPI) 2 50-port GbE switches (storage) Workshop on exascale and PRACE prototypes, 20 May 2014 5

Partners + roles BSC PRACE partner Bull System integrator E4 Subcontractor of Bull for system integration Seco Boards provider (Q7 + carrier board) Nvidia GP-GPU provider Support for CUDA software stack Mellanox High speed interconnection provider Support for IB software stack Workshop on exascale and PRACE prototypes, 20 May 2014 6

Prototype procurement Legal process Public tender with publicity required Spain: amount >60 KEuro EU: amount >200 KEuro Exclusive contract to Bull Timeline: Mar 2012 - Project start 18 Jan 2013 - Tender published 25 Feb 2013 - Proposal deadline 28 Mar 2013 - Bull proposal accepted 22 Apr 2013 - Contract signed 28 Aug 2013 - First node delivered Sep 2013 - Final installation Workshop on exascale and PRACE prototypes, 20 May 2014 7

Prototype deployment Power supply Data center hosting the final installation did not have enough power capability SOLUTION: connected to a second power grid, no issues Fans System is completely air cooled: 2 fans per node, >150 in total! No dynamic regulation of revolution speed of the fans Problem of noise Problem of power consumption (~25 W per node) SOLUTION: installed manual speed regulators for fans Temperature sensor of the computer room Installed on top of one of the temperature sensor of the computing room False temperature measurements of part of the data center SOLUTION: move the sensor Workshop on exascale and PRACE prototypes, 20 May 2014 8

Architectural issues Coherency protocol within the memory controller (it seems that) Ordering of PCIe transaction is not guaranteed between different PCI devices Polling does not work Requires re-write of drivers avoiding polling SOLUTION: Extremely difficult, unless you have strong commitment of providers PCIe bandwidth CPU: 4x Gen1, 1 GB/sec GPU: 16x Gen3, ~15 GB/sec SOLUTION: Impossible to overcome with current technology Memory issues: 2 GB on the host / 5 GB on GPU SOLUTION: Impossible to overcome with current technology Board Management Control Impossibility of monitor/handle remotely behaviour of the system Due to nature of the hardware (embedded not HPC) SOLUTION: Development of special motherboard (with obvious impacts on the prices) Workshop on exascale and PRACE prototypes, 20 May 2014 9

Prototype evaluation What works Multi-core CPU GPU + CUDA support GbE interconnection IBoverIP Login nodes HPC software stack GPU 1 NW 1 NW 2 GPU 2 PCIe switch PCIe switch MEM SoC 1 SoC 2 What doesn t work InfiniBand over RDMA due to lack of Mellanox support for 32bit platforms No full OpenFabrics Enterprise Distribution (OFED) available for ARM therefore verbs not usable!!! GPUdirect!!!!!! MEM Workshop on exascale and PRACE prototypes, 20 May 2014 10

Advances over State Of The Art Contribution to the development of HPC software ecosystem on ARM Including CUDA support on ARM based architecture Performance evaluation of ARM based architecture Test of ARM based platforms on large scale Pedraforca has been the first large HPC system based on ARM architecture Contribution pointing out limiting factors of current ARM based platforms (last slide of Alex yesterday) Workshop on exascale and PRACE prototypes, 20 May 2014 11

Results: benchmarks CPU stream benchmarks Per-node power consumption Op. Threads Perf. Power Eff. E.Eff. [MB/s] W % [MB/J] Copy 1 1229 68.8 20.5 17.86 2 1633 69.2 27.2 23.60 4 1633 70.3 27.2 23.23 Scale 1 1299 69.0 21.7 18.83 2 1610 69.4 26.8 23.20 4 1591 70.5 26.5 22.57 Sum 1 750 68.9 12.5 10.89 2 1247 69.2 20.8 18.02 4 1281 70.4 21.4 18.20 Triad 1 755 68.9 12.6 10.96 2 1140 69.3 19.0 16.45 4 1154 70.4 19.2 16.39 * Arka desktop node excludes inefficient fans 12

Results: Lattice Boltzmann on Pedraforca Fluid dynamics simulations evolving a 2D array of particles (double) interacting with their third neighbours. This translates in a regular pattern of: floating point computation collide memory accesses propagate. Propagate Collide Machine Power [W] Performance [GB/s] Perf/Power [GB/J] Time per iteration [ms] Power [W] Performance [GB/s] Perf/Power [GB/J] Time per iteration [ms] Pedraforca 148 129.57 0.88 41.95 187 383.23 2.05 9.58 Coka * 233 128.16 0.55 34.85 300 461.28 1.54 9.68 * Coka: 2 x Intel SandyBridge 12-core + 2 x Nvidia K20M (idle Intel MIC was removed for power measurement) Workshop on exascale and PRACE prototypes, 20 May 2014 13

Results: Education Teaching PATC Course (last Friday): Programming ARM based prototypes Workshop on exascale and PRACE prototypes, 20 May 2014 14

Results: Product of E4 Computer Engineering Arka EK002 twin server: Workshop on exascale and PRACE prototypes, 20 May 2014 15

Results: Feelings Like when you are a kid: you put a lot of effort in building a beautiful LEGO construction but You miss a few bricks to finish it (InfiniBand support) You do not have friends to play with (underutilized prototype) Workshop on exascale and PRACE prototypes, 20 May 2014 16

Lessons learned: OK stress unbalanced architectures, but not too much PCIe: Gen1 vs Gen3 RAM: 2GB on host vs 5GB on device Strong commitment of all the parts involved is required No only OEM, but also final providers! How to obtain commitment is an open question: Carlo suggested informing providers Radek suggested pushing providers with the power of a community making appealing/interesting/profitable the needs of the project Size matters: bigger prototype means bigger risk!!! Workshop on exascale and PRACE prototypes, 20 May 2014 17

Conclusions Large scale cluster with ARM + GPU Some delays in deployment But system is truly new Prototype as network of GPUs: failure due to hw configuration + missing sw support from the providers System software leadership First ARM based HPC system with CUDA Benefit to scientists who ported codes to x86 + CUDA GPU Encourage European industry Embedded industry to develop HPC-ready components E4 Computer Systems commercialising technology in ARKA series Good educational platform 18