The Intel Parallel Computing Center at the University of Bristol. Simon McIntosh-Smith Department of Computer Science

Similar documents
GPU System Architecture. Alan Gray EPCC The University of Edinburgh

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Turbomachinery CFD on many-core platforms experiences and strategies

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Parallel Programming Survey

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

walberla: A software framework for CFD applications on Compute Cores

Case Study on Productivity and Performance of GPGPUs

Big Data Visualization on the MIC

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

GPU Acceleration of the SENSEI CFD Code Suite

Accelerating CFD using OpenFOAM with GPUs

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Introduction to GPU Programming Languages

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Evaluation of CUDA Fortran for the CFD code Strukti

Part I Courses Syllabus

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

Pedraforca: ARM + GPU prototype

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Trends in High-Performance Computing for Power Grid Applications

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Recent and Future Activities in HPC and Scientific Data Management Siegfried Benkner

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Overview of HPC systems and software available within

Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory. The Nation s Premier Laboratory for Land Forces UNCLASSIFIED

Visit to the National University for Defense Technology Changsha, China. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory

CUDA programming on NVIDIA GPUs

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Introduction to GPU Computing

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

Keys to node-level performance analysis and threading in HPC applications

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

ST810 Advanced Computing

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

HPC enabling of OpenFOAM R for CFD applications

HPC Wales Skills Academy Course Catalogue 2015

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute

The Value of High-Performance Computing for Simulation

NVIDIA Tesla K20-K20X GPU Accelerators Benchmarks Application Performance Technical Brief

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Stream Processing on GPUs Using Distributed Multimedia Middleware

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Intelligent Heuristic Construction with Active Learning

~ Greetings from WSU CAPPLab ~

Retargeting PLAPACK to Clusters with Hardware Accelerators

Building an energy dashboard. Energy measurement and visualization in current HPC systems

Parallel Software usage on UK National HPC Facilities : How well have applications kept up with increasingly parallel hardware?

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Multicore Parallel Computing with OpenMP

Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems

The Top Six Advantages of CUDA-Ready Clusters. Ian Lumb Bright Evangelist

PSE Molekulardynamik

How to choose a suitable computer

Modeling Rotor Wakes with a Hybrid OVERFLOW-Vortex Method on a GPU Cluster

10- High Performance Compu5ng

Resource Scheduling Best Practice in Hybrid Clusters

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

Introduction to GPU hardware and to CUDA

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

CFD Implementation with In-Socket FPGA Accelerators

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Parallel Computing with MATLAB

The GPU Accelerated Data Center. Marc Hamilton, August 27, 2015

GPUs for Scientific Computing

Overview of HPC Resources at Vanderbilt

Introduction to Cloud Computing

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

HPC with Multicore and GPUs

Performance of the JMA NWP models on the PC cluster TSUBAME.

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

GPU-accelerated Large Scale Analytics using MapReduce Model

Altix Usage and Application Programming. Welcome and Introduction

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Integrated Communication Systems

HPC Programming Framework Research Team

Mississippi State University High Performance Computing Collaboratory Brief Overview. Trey Breckenridge Director, HPC

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

High Performance Computing in CST STUDIO SUITE

Recent Advances in HPC for Structural Mechanics Simulations

High Performance Computing

Min Si. Argonne National Laboratory Mathematics and Computer Science Division

GPU Computing - CUDA

SOSCIP Platforms. SOSCIP Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

BLM 413E - Parallel Programming Lecture 3

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Transcription:

The Intel Parallel Computing Center at the University of Bristol Simon McIntosh-Smith Department of Computer Science 1

! Bristol's rich heritage in HPC The University of Bristol is one of the top HPC institutes in the UK: It has a vibrant HPC community of >500 researchers, >10% of all staff Invests over 2.5m p.a. in local HPC Trains over 100 HPC computer scientists each year (Bristol CS ranked #4 in UK) 2

! HPC resources in Bristol Blue Crystal supercomputer: 12m invested since 2006 Amongst the fastest in the UK ~10,000 processor cores ~250 TFLOPS >1 PetaByte of data storage Bristol is a leader in the use of many-core accelerators: Intel Xeon Phi Nvidia, AMD GPUs 3

! Intel Parallel Computing Center Intel chose to invest in the University of Bristol to establish its first "Intel Parallel Computing Center (IPCC)" in the UK (Feb 2014): The University of Bristol combines both a demonstrated ability to innovate and optimize parallel applications using open, industry-standard techniques with a focus on practical education of the next generation of application developers, - Joe Curley, Intel 4

! Bristol IPCC activities Optimising various HPC codes for Xeon Phi: BUDE Molecular docking code ROTORSIM Multi-block, multi-grid CFD code CloverLeaf/TeaLeaf Hydrodynamics benchmark Lattice Boltzmann and more 5

! Molecular Docking in Bristol BUDE (Bristol University Docking Engine) is one of the fastest and most accurate molecular docking codes in the world. BUDE is being used to find new drug targets for influenza, malaria, Alzheimer's, Emphysema, Insulin signalling and more "High Performance in silico Virtual Drug Screening on Many-Core Processors", S. McIntosh-Smith, 6 J. Price, R.B. Sessions, A.A. Ibarra, IJHPCA 2014

! BUDE's algorithm 7

! BUDE's conditional behaviour McIntosh-Smith, S., et al., Benchmarking Energy Efficiency, Power Costs and Carbon Emissions 8 on Heterogeneous Systems. Computer Journal, 2012. 55(2): p. 192-205.

! BUDE optimisations for Xeon Phi 9

! More BUDE optimisations for Phi Work-group and NDrange sizes are very important (multiples of 16, 240 etc.) Lots of input into Xeon Phi OpenCL driver Have seen a 2-3X improvement in last year Vtune on Phi proving very useful 10

! BUDE Xeon Phi results 32% 44% 45% 40% 35% 30% Efficiency Sustained TFLOP/s 0.8 0.6 0.4 0.2 0.0 0.68 Intel Xeon Phi SE10P 0.35 Intel E5-2687W (x2) One Xeon Phi SE10P 1.94X faster than 16 cores of Sandy Bridge at 3.1GHz 11

! Cloverleaf: A Lagrangian- Eulerian hydrodynamics benchmark A bandwidth-limited structured grid code that is part of Sandia's "Mantevo" benchmark suite Solves the compressible Euler equations, which describe the conservation of energy, mass and momentum in a system. These equations are solved on a Cartesian grid in 2D with second-order accuracy, using an explicit finite-volume method. Optimised parallel versions exist in OpenMP, MPI, OpenCL, OpenACC, CUDA and Co-Array Fortran 12

! CloverLeaf optimisations for Xeon Phi Focused on OpenCL and OpenMP Task granularity is crucial Important to get rid of bounds checking Memory alignment and access patterns are also very significant Many simultaneous memory streams can cause the TLBs to thrash Barrier placements critical Adding barriers can improve performance 13

! Vtune can be really useful on Phi 14

! CloverLeaf Xeon Phi results 1.9X 16 cores at 3.1GHz S.N. McIntosh-Smith, M. Boulton, D. Curran, & J.R. Price, On the performance portability of structured grid codes on many-core computer architectures, To appear, 15 International Supercomputing, Leipzig, June 2014.

! Summary Bristol is a leader in exploiting many-core architectures to deliver cutting-edge HPC Xeon Phi can deliver acceleration of 1.5-2.0X for real HPC codes There's a very worrying trend that many HPC codes are not evolving fast enough to be ready for the many-core trend Implication: these codes will fail to get good performance on Xeon, never mind Xeon Phi! 16

! OpenCL conference in Bristol http://iwocl.org IWOCL ("eye-wok-ul") May 12-13 th 2014 Bristol, UK In an award-winning science museum 2 days of technical talks and workshops 17

18