Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

Similar documents
GPU Hardware CS 380P. Paul A. Navrá7l Manager Scalable Visualiza7on Technologies Texas Advanced Compu7ng Center

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Parallel Programming Survey

A quick tutorial on Intel's Xeon Phi Coprocessor

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Pedraforca: ARM + GPU prototype

Next Generation GPU Architecture Code-named Fermi

Keys to node-level performance analysis and threading in HPC applications

Kriterien für ein PetaFlop System

Trends in High-Performance Computing for Power Grid Applications

GPUs for Scientific Computing

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Big Data Visualization on the MIC

Intel Many Integrated Core Architecture: An Overview and Programming Models

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Sun Constellation System: The Open Petascale Computing Architecture

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

FLOW-3D Performance Benchmark and Profiling. September 2012

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Data Center and Cloud Computing Market Landscape and Challenges

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Altix Usage and Application Programming. Welcome and Introduction

SGI High Performance Computing

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

High Productivity Computing With Windows

PRIMERGY server-based High Performance Computing solutions

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster


Large-Data Software Defined Visualization on CPUs

HPC Update: Engagement Model

ST810 Advanced Computing

Infrastructure Matters: POWER8 vs. Xeon x86

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

x64 Servers: Do you want 64 or 32 bit apps with that server?

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

benchmarking Amazon EC2 for high-performance scientific computing

Hadoop on the Gordon Data Intensive Cluster

Intel Xeon Processor E5-2600

Running Native Lustre* Client inside Intel Xeon Phi coprocessor

Advancing Applications Performance With InfiniBand

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

ECLIPSE Performance Benchmarks and Profiling. January 2009

Parallel Algorithm Engineering

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

HPC Wales Skills Academy Course Catalogue 2015

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Remote & Collaborative Visualization. Texas Advanced Compu1ng Center

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Data Centric Systems (DCS)

LS DYNA Performance Benchmarks and Profiling. January 2009

How To Build A Cloud Computer


Multi-Threading Performance on Commodity Multi-Core Processors

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Introduction to GPU Programming Languages

CUDA programming on NVIDIA GPUs

Clusters: Mainstream Technology for CAE

Improving Grid Processing Efficiency through Compute-Data Confluence

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

An examination of the dual-core capability of the new HP xw4300 Workstation

Using the Windows Cluster

Parallel Large-Scale Visualization

Modeling Big Data/HPC Storage Using Massively Parallel Simula:on

Overview of HPC systems and software available within

Turbomachinery CFD on many-core platforms experiences and strategies

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Rethinking SIMD Vectorization for In-Memory Databases

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

LANL Computing Environment for PSAAP Partners

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Parallel Computing with MATLAB

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2015

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Deep Learning GPU-Based Hardware Platform

Cluster Implementation and Management; Scheduling

Transcription:

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial Bill Barth, Kent Milfeld, Dan Stanzione Tommy Minyard Texas Advanced Computing Center Jim Jeffers, Intel June 2013, Leipzig, Germany

Acknowledgements Thanks/kudos to: Sponsor: National Science Foundation NSF Grant #OCI-1134872 Stampede Award, Enabling, Enhancing, and Extending Petascale Computing for Science and Engineering NSF Grant #OCI-0926574 - Topology-Aware MPI Collectives and Scheduling Many, many Dell, Intel, and Mellanox engineers Professor D.K. Panda at OSU (MVAPICH2) All my colleagues at TACC who saw way too much of each other during the last year 2

Thanks for Coming In this tutorial, we will teach you what you need to know to start running your code on the Intel Xeon Phi Co-Processor. For the hands-on portion, we will user the Stampede system at TACC in Austin, Texas,USA, which has 6,880 Xeon Phi cards 3

Today s Session Dan: (45 minutes) An overview of Intel s Many Core Architecture Overview of Stampede Programming Models for the Xeon Phi Bill: Understanding Vectorization (45 minutes) Break Kent: Threading models and Offload (45) Hands-on lab (Vectorization and OpenMP) 1 hour and whatever is left 4

Stampede - High Level Overview Base Cluster (Dell/Intel/Mellanox): Intel Sandy Bridge processors Dell dual-socket nodes w/32gb RAM (2GB/core) 6,400 nodes 56 Gb/s Mellanox FDR InfiniBand interconnect More than 100,000 cores, 2.2 PF peak performance Co-Processors: Intel Xeon Phi MIC Many Integrated Core processors Special release of Knight s Corner (61 cores) All MIC cards are on site at TACC more than 6000 installed final installation ongoing for formal summer acceptance 7+ PF peak performance Max Total Concurrency: exceeds 500,000 cores 1.8M threads Entered production operations on January 7, 2013 5

Additional Integrated Subsystems Stampede includes 16 1TB Sandy Bridge shared memory nodes with dual GPUs 128 of the compute nodes are also equipped with NVIDIA Kepler K20 GPUs (and MICS for performance bake-offs) 16 login, data mover and management servers (batch, subnet manager, provisioning, etc) Software included for high throughput computing, remote visualization Storage subsystem driven by Dell storage nodes: Aggregate Bandwidth greater than 150GB/s More than 14PB of capacity Similar partitioning of disk space into multiple Lustre filesystems as previous TACC systems ($HOME, $WORK and $SCRATCH) 6

Power/Physical Stampede spans 182 48U cabinets. Power density (after upgrade in 2015) will exceed 40kW per rack. Estimated 2015 peak power is 6.2MW. 7

Stampede Datacenter Features Thermal energy storage to reduce peak power charges Hot aisle containment to boost efficiency (and simply provide enough cooling). Total IT power to 9.5MW, total power ~12MW. Expand experiments in mineral oil cooling. 8

New Construction in 2012 EXAS ADVANCED Required COMPUTING for Stampede FACILITY THE UNIVERSITY OF TEXAS Construction Started Nov. 2011 DESIGN DEVELOPMENT / GMP PRICING SET 9

Stampede Footprint Ranger Stampede 8000 ft 2 ~10PF 6.5 MW 3000 ft2 0.6 PF 3 MW Machine Room Expansion Added 6.5MW of additional power 10

Stampede Datacenter February 20th 11

Stampede Datacenter March 22nd 12

Stampede Datacenter May 16th 13

Stampede Datacenter June 20th 14 14

Stampede Datacenter, ~September 10th 15

Stampede Datacenter, ~September 10th 16

Some utilities are involved 17

Actually, way more utility space than machine space Turns out the utilities for the datacenter costs more, takes more time and more space than the computing systems 18

Innovative Component One of the goals of the NSF solicitation was to introduce a major new innovative capability component to science and engineering research communities We proposed the Intel Xeon Phi coprocessor (many integrated core or MIC) one first generation Phi installed per host during initial deployment confirmed injection of 1600 future generation MICs in 2015 (5+ PF) 19

An Inflection point (or two) in High Performance Computing Relatively boring, but rapidly improving architectures for the last 15 years. Performance rising much faster than Moore s Law But power rising faster And concurrency rising faster than that, with serial performance decreasing. Something had to give 20

Consider this Example Homogeneous v. Heterogeneous Acquisition costs and power consumption Homogeneous system: K computer in Japan 10 PF unaccelerated Rumored $1.2 billion acquisition costs Large power bill and foot print Heterogeneous system: Stampede at TACC Near 10PF (2 SNB + up to 8 MIC) Modest footprint: 8,000 ft 2, 6.5MW $27.5 million 21 21

Accelerated Computing for Exascale Exascale systems, predicted for 2018, would have required 500MW on the old curves. Something new was clearly needed. The accelerated computing movement was reborn (this happens periodically, starting with 387 math coprocessors). 22

Key aspects of acceleration We have lots of transistors Moore s law is holding; this isn t necessarily the problem. We don t really need lower power per transistor, we need lower power per *operation*. How to do this? nvidia GPU -AMD Fusion FPGA -ARM Cores 23

Intel s MIC approach Since the days of RISC vs. CISC, Intel has mastered the art of figuring out what is important about a new processing technology, and saying why can t we do this in x86? The Intel Many Integrated Core (MIC) architecture is about large die, simpler circuit, much more parallelism, in the x86 line. 24

What is MIC? Basic Design Ideas: Leverage x86 architecture (a CPU with many cores) Use x86 cores that are simpler, but allow for more compute throughput Leverage existing x86 programming models Dedicate much of the silicon to floating point ops., keep some cache(s) Keep cache-coherency protocol Increase floating-point throughput per core Implement as a separate device Strip expensive features (out-of-order execution, branch prediction, etc.) Widened SIMD registers for more throughput (512 bit) Fast (GDDR5) memory on card 25

Xeon Phi MIC Xeon Phi = first product of Intel s Many Integrated Core (MIC) architecture Co-processor PCI Express card Stripped down Linux operating system But still a full host you can log in to the Phi directly and run stuff Lots of names Many Integrated Core architecture, aka MIC Knights Corner (code name) Intel Xeon Phi Co-processor SE10P (product name of the version for Stampede) I should have many more trademark symbols in my slides if you ask Intel 26

Intel Xeon Phi Chip 22 nm process Based on what Intel learned from Larrabee SCC TeraFlops Research Chip 27

MIC Architecture Many cores on the die L1 and L2 cache Bidirectional ring network for L2 Memory and PCIe connection 28 28

George Chrysos, Intel, Hot Chips 24 (2012): http://www.slideshare.net/intelxeon/under-the-armor-of-knights-corner-intel-mic-architecture-athotchips-2012 29

George Chrysos, Intel, Hot Chips 24 (2012): http://www.slideshare.net/intelxeon/under-the-armor-of-knights-corner-intel-mic-architecture-athotchips-2012 30

Speeds and Feeds Stampede SE10P version (Your mileage may vary) Processor ~1.1 GHz 61 cores 512-bit wide vector unit 1.074 TF peak DP Data Cache L1 32KB/core L2 512KB/core, 30.5 MB/chip Memory 8GB GDDR5 DRAM 5.5 GT/s, 512-bit* PCIe 5.0 GT/s, 16-bit 31

What we at TACC like about MIC Intel s MIC is based on x86 technology x86 cores w/ caches and cache coherency SIMD instruction set Programming for MIC is similar to programming for CPUs Familiar languages: C/C++ and Fortran Familiar parallel programming models: OpenMP & MPI MPI on host and on the coprocessor Any code can run on MIC, not just kernels Optimizing for MIC is similar to optimizing for CPUs Optimize once, run anywhere Optimizing can be hard; but everything you do to your code should *also* improve performance on current and future regular Intel chips, AMD CPUs, etc. 32

The MIC Promise Competitive Performance: Peak and Sustained Familiar Programming Model HPC: C/C++, Fortran, and Intel s TBB Parallel Programming: OpenMP, MPI, Pthreads, Cilk Plus, OpenCL Intel s MKL library (later: third party libraries) Serial and Scripting, etc. (anything a CPU core can do) Easy transition for OpenMP code Pragmas/directives added to offload OMP parallel regions onto MIC See examples later Support for MPI ü MPI tasks on host, communication through offloading ü MPI tasks on MIC, communication through MPI 33 33

Programming the Xeon Phi Some universal truths of Xeon Phi coding: You need lots of threads. Really. Lots. At least 2 per core to have a chance 3 or 4 per core are often optimal. This is not like old school hyperthreading, if you don t have 2 threads per core, half your cycles will be wasted. You need to vectorize. This helps on regular processors too, but the effect is much more pronounced on the MIC chips. You can no longer ignore the vectorization report from the compiler If you can do these 2 things, good things will happen, in most any language. 34

MIC Programming Experiences at TACC Codes port easily Minutes to days depending mostly on library dependencies Performance requires real work While the silicon continues to evolve Getting codes to run *at all* is almost too easy; really need to put in the effort to get what you expect Scalability is pretty good Multiple threads per core *really important* Getting your code to vectorize *really important* 35

Stampede Programming Models Traditional Cluster, or Host Only Pure MPI and MPI+X (OpenMP, TBB, Cilk+, OpenCL ) Ignore the MIC Offload MPI on Host, Offload to Xeon Phi Targeted offload through OpenMP extensions Automatically offload some library routines with MKL Native Phi Use one Phi and run OpenMP or MPI programs directly Symmetric - MPI tasks on Host and Phi Treat the Phi (mostly) like another host Pure MPI and MPI+X (limited memory: using X is almost mandatory) 36

Programming)Intel )MIC1based)Systems) MPI+Offload+ MPI$ranks$on$Intel $ Xeon $ processors$(only)$ All$messages$into/out$of$ processors$ Offload$models$used$to$ accelerate$mpi$ranks$ Intel $Cilk TM $Plus,$OpenMP*,$ Intel $Threading$Building$ Blocks,$Pthreads*$within$Intel $ MIC$ Homogenous$network$of$hybrid$ nodes:$ Network MPI$$ Data Xeon Data Xeon Data Xeon Data Xeon Offload$ MIC MIC MIC MIC 11$ 37

38 Programming)Intel )MIC1based)Systems) Many%core*Hosted* MPI$ranks$on$Intel $MIC$ (only)$ All$messages$into/out$of$Intel $ MIC$ Intel $Cilk TM $Plus,$OpenMP*,$ Intel $Threading$Building$ Blocks,$Pthreads$used$directly$ within$mpi$processes$ Programmed$as$homogenous$ network$of$manydcore$cpus:$ 13$ Xeon MIC Xeon MIC Xeon MIC Xeon MIC Network Data Data Data Data MPI$$

39 Programming)Intel )MIC1based)Systems) Symmetric) MPI$ranks$on$Intel $MIC$and$ Intel $Xeon $processors$ Messages$to/from$any$core$ Intel $Cilk TM $Plus,$OpenMP*,$ Intel $Threading$Building$ Blocks,$Pthreads*$used$directly$ within$mpi$processes$ $ Programmed$as$heterogeneous$$ network$of$homogeneous$ nodes:$ 14$ Xeon MIC Xeon MIC Xeon MIC Xeon MIC Network Data Data Data Data MPI$$ Data Data Data Data MPI$$ MPI$$

Native Execution Build for Phi with mmic Execute on host (runtime will automatically detect an executable built for Phi) or ssh to mic0 and run on the Phi Can safely use all 61 cores Use at least 2 per core, really Offload programs should certainly stay away from the 61 st core since the offload daemon runs here 40

Symmetric MPI Host and Phi can operate symmetrically as MPI targets High code reuse MPI and hybrid MPI+X (X = OpenMP, Cilk+, TBB, pthreads) Careful to balance workload between big cores and little cores Careful to create locality between local host, local Phi, remote hosts, and remote Phis 41

Symmetric MPI Typical 1-2 GB per task on the host Targeting 1-10 MPI tasks per Phi on Stampede With 6+ threads per MPI task 42

MPI with Offload to Phi Existing codes using accelerators have already identified regions where offload works well Porting these to OpenMP offload should be straightforward Automatic offload where MKL kernel routines can be used xgemm, etc. 43

LBM Example Lattice Boltzmann Method CFD code Carlos Rosales, TACC MPI code with OpenMP Finding all the right routines to parallelize is critical 44

Hydrostatic ice sheet modeling MUMPS solver(called through PETSC) BLAS calls automatically offloaded behind the scenes PETSc/MUMPS with AO 45

Will My Code Run on Xeon Phi? Yes but that s the wrong question Will your code run *best* on Phi?, or Will you get great Phi performance without additional work? 46

Other options In this tutorial, we focus mainly on C/C++ and Fortran, with threading through OpenMP Other options are out there, notably: Intel s cilk extensions for C/C++ TBB (Threading Building Blocks) for C++ OpenCL support is available, but we have little experience so far. It s Linux, you can run whatever you want (ie., I have run a pthreads code on Phi). 47

Stampede: How Do Users Use It? 2+ PF Xeon-only system (MPI, OpenMP) Many users will use it as an extremely powerful Sandy Bridge cluster and that s OK They may also use the shared memory nodes, remote vis 7+ PF MIC-only system (MPI, OpenMP) Homogeneous codes can be run entirely on the MICs ~10PF heterogeneous system (MPI, OpenMP) Run separate MPI tasks on Xeon vs. MIC; use OpenMP extensions for offload for hybrid programs 48

Summary MIC experiences Early scaling looks good; application porting is fairly straight forward since it can run native C/C++, and Fortran code Some optimization work is still required to get at all the available raw performance for a wide variety of applications; but working well for some apps vectorization on these large many-core devices is key affinitization can have a strong impact (positive/negative) on performance algorithmic threading performance is also key; if the kernel of interest does not have high scaling efficiency on a standard x86_64 processor (8-16 cores), it will not scale on many-core MIC optimization efforts also yield fruit on normal Xeon (in fact, you may want to optimize there first. Questions? Then let s dive in deeper 49