ST810 Advanced Computing



Similar documents
GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Parallel Computing with MATLAB

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Introduction to GPU Programming Languages

Several tips on how to choose a suitable computer

Overview of HPC Resources at Vanderbilt

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Turbomachinery CFD on many-core platforms experiences and strategies

HPC with Multicore and GPUs

Retargeting PLAPACK to Clusters with Hardware Accelerators

Introduction to GPGPU. Tiziano Diamanti

Trends in High-Performance Computing for Power Grid Applications

GPUs for Scientific Computing

Case Study on Productivity and Performance of GPGPUs

HPC Wales Skills Academy Course Catalogue 2015

Evaluation of CUDA Fortran for the CFD code Strukti


1 DCSC/AU: HUGE. DeIC Sekretariat /RB. Bilag 1. DeIC (DCSC) Scientific Computing Installations

Parallel Programming Survey

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Introduction to GPU hardware and to CUDA

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Several tips on how to choose a suitable computer

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

CUDA programming on NVIDIA GPUs

GPGPU Computing. Yong Cao

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Videocard Benchmarks Over 600,000 Video Cards Benchmarked

Introduction to GPU Architecture

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Graphic Processing Units: a possible answer to High Performance Computing?

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

GPGPU accelerated Computational Fluid Dynamics

Comparison of best notebooks

How to choose a suitable computer

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Overview of HPC systems and software available within

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

GPU Parallel Computing Architecture and CUDA Programming Model

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

GENERAL PURPOSE HARDWARE REQUIREMENTS (Laptop or Desktop)

A quick tutorial on Intel's Xeon Phi Coprocessor

Next Generation GPU Architecture Code-named Fermi

Accelerating CFD using OpenFOAM with GPUs

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

GPU Programming in Computer Vision

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

Installation and Operational Needs of Multi-purpose GPU Clusters

Le langage OCaml et la programmation des GPU

Introduction to the CUDA Toolkit for Building Applications. Adam DeConinck HPC Systems Engineer, NVIDIA

PassMark - G3D Mark High End Videocards - Updated 26th of May 2012

Part I Courses Syllabus

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

High Performance Matrix Inversion with Several GPUs

10- High Performance Compu5ng

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Parallel Computing. Introduction

IP Video Rendering Basics

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

Autodesk Revit 2016 Product Line System Requirements and Recommendations

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

F ALL Computer Purchase Guide

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

GPU Acceleration of the SENSEI CFD Code Suite

Performance of the JMA NWP models on the PC cluster TSUBAME.

GPU for Scientific Computing. -Ali Saleh

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS UPDATE

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Intel Pentium 4 Processor on 90nm Technology

Scientific Computing Programming with Parallel Objects

Introduction to GPU Computing

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Qualified Apple Mac Workstations for Avid Media Composer v5.0.x

Choosing a Computer for Running SLX, P3D, and P5

2020 Design Update Release Notes November 10, 2015

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

1 Bull, 2011 Bull Extreme Computing

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

HP ProLiant SL270s Gen8 Server. Evaluation Report

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Transcription:

ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013

Outline computing Hardware computing overview Matlab R

ST810 Lecture 17 Hardware Typical s on current laptops I I I R E.g., my MacBook Pro has an Intel HD Graphics 4000 (built in R with i7-3720qm CPU) and a NVIDIA GeForce GT 650M R NVIDIA GT 650M has 1G memory, 524K L2 cache, 384 cores @ 0.9 GHz Theoretical throughput: 641.3 SP GFLOPS

Hardware Typical s on current desktops My desktop (Dell Alienware) has a NVIDIA R GeForce GTX 580 GTX 580 has 1.5G memory, 786K L2 cache, 512 cores @ 1.59 GHz Theoretical throughput: 1581 SP GFLOPS Release Price: $500 (Nov 2010)

Hardware Typical s on current servers The teaching server has 4 x NVIDIA R Tesla M2070Q Each Tesla M2070Q has 6G memory (5.25G with ECC), 786K L2 cache, 448 cores @ 1.15 GHz Theoretical throughput: 4 x 1288 SP GFLOPS or 4 x 512 DP GFLOPS

Hardware Graphics Processing Units (s) Ubiquitous in today s hardware (PCs, laptops, servers) Cost effective for high performance computing Rapid growth in recent years Our department has at least two servers. Many nodes in NCSU HPC henry2 are equipped with s too

Hardware vs CPU architecture s contain 100s of processing cores on a single chip; several chips can fit in a desktop PC Each core carries out the same operations in parallel on different input data single program, multiple data (SPMD) paradigm Extremely high arithmetic intensity *if* one can transfer the data onto and results off of the processors quickly

ST810 Lecture 17 Hardware CPU An analogy taken from Andrew Beam s presentation in ST790

computing overview GP - General purpose computing My experience Almost always involve (new) algorithm development and/or revamping CPU code Research before going for GP (next slide) Easier to develop in C/C++ (free compiler), Fortran (compiler $), and Matlab Do not reinvent the wheel use libraries

computing overview Before using s 0. Frustrated by slow code... 1. Am I using the right algorithm(s)? Go to your ST758 notes or a numerical analysis book. E.g., for massive data (terabytes), an O(n 2 ) algorithm vs an O(n log n) means a 31710 years vs 27 seconds difference on a TFLOPS supercomputer 2. Repeat: Profile and optimize original code 3. Can a compiled language or optimized library (MKL, ATLAS) help? 4. Identify the bottleneck routine and research the potential gain on 5. Can my data fit into memory? 6. Can other routines be easily implemented on? Is that necessary? 7. Decide the toolchain: Matlab, CUDA, PGI toolchain,...

computing overview GP development A few approaches to developing GP code CUDA R toolchain provided by NVIDIA R free C/C++ only for NIVIDA cards PGI R toolchain (CUDA Fortran) $$$ C/C++, Fortran only for NVIDIA cards OpenCL TM (Open Computing Language) open source Specs for cross-platform, parallel programming of modern processors (PCs, servers, handheld/embedded devices) Adopted by Intel, AMD,... Use a higher level language such as Matlab

computing overview Which card to use? AMD vs NVIDIA NVIDIA cards are more widely adopted for GP E.g., servers in our department and NCSU henry2 cluster all have NVIDIA NVIDIA has a much richer set of math libraries Cross-platform feature of OpenCL is attractive AMD NVIDIA Cards ATI Radeon GTX, Tesla Language OpenCL CUDA C/C++, PGI CUDA Fortran math libraries APPML (BLAS,FFT) cublas, cufft, cusparse curand, CUDA MATH, Thrust,... Platforms Linux, Windows Linux, Windows, MacOS

Matlab computing in Matlab Getting started gpudevice(): query device methods( gpuarray ): built-in functions that support

Matlab 290 built-in functions in Matlab 2012b support

Matlab Scheme for algorithm development on Matlab % transfer data to and initialize variables gx = gpuarray (X); gy = gpuarray (Y); gbetahat = gpuarray.randn (5, 1);... % computation on... % transfer result off betahat = gather (gbetahat); Key: minimize memory transfer between host memory and memory

Matlab Benchmarking Always benchmark the bottleneck routine before embarking on E.g., to benchmark A\b (solve linear equations) on my desktop paralleldemo_gpu_backslash() in Matlab 2012b 700 600 CPU Single precision performance 200 180 CPU Double precision performance 160 500 140 Gigaflops 400 300 Gigaflops 120 100 200 80 60 100 40 0 0 2000 4000 6000 8000 10000 12000 Matrix size 20 1000 2000 3000 4000 5000 6000 7000 8000 9000 Matrix size Intel i7 960 CPU vs NVIDIA GTX 580

R computing in R Not supported in base R (opportunity? HiPLARM package) A few contributed packages in specific application areas: gputools (some data-mining algorithms), cudabayesreg (fmri analysis),... Develop in C/C++ or Fortran and call compiled code from R