~ Greetings from WSU CAPPLab ~



Similar documents
GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU Parallel Computing Architecture and CUDA Programming Model

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Next Generation GPU Architecture Code-named Fermi

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

GPUs for Scientific Computing

Parallel Programming Survey

Introduction to Cloud Computing

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GPU hardware and to CUDA

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Introduction to GPU Computing

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Stream Processing on GPUs Using Distributed Multimedia Middleware

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

CUDA programming on NVIDIA GPUs

Computer Graphics Hardware An Overview

NVIDIA GeForce GTX 580 GPU Datasheet

Introduction to GPU Programming Languages

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

CS 394 Introduction to Computer Architecture Spring 2012

Enabling Technologies for Distributed and Cloud Computing

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Generations of the computer. processors.

Introduction to GPGPU. Tiziano Diamanti

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

High Performance Computing in CST STUDIO SUITE

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

GeoImaging Accelerator Pansharp Test Results

GPU Computing - CUDA

Trends in High-Performance Computing for Power Grid Applications

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Chapter 2 Parallel Computer Architecture

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Clustering Billions of Data Points Using GPUs

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Discovering Computers Living in a Digital World

Embedded Systems: map to FPGA, GPU, CPU?

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Multi-Threading Performance on Commodity Multi-Core Processors

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

GPU Architecture. Michael Doggett ATI

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

ADVANCED COMPUTER ARCHITECTURE

Enabling Technologies for Distributed Computing

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Let s put together a Manual Processor

Chapter 4 System Unit Components. Discovering Computers Your Interactive Guide to the Digital World

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Overview. CPU Manufacturers. Current Intel and AMD Offerings

HPC with Multicore and GPUs

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

Turbomachinery CFD on many-core platforms experiences and strategies

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

Parallel Computing with MATLAB

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Optimizing Application Performance with CUDA Profiling Tools

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Course Development of Programming for General-Purpose Multicore Processors

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

1. If we need to use each thread to calculate one output element of a vector addition, what would

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

An examination of the dual-core capability of the new HP xw4300 Workstation

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Performance Measurement of a High-Performance Computing System Utilized for Electronic Medical Record Management

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

CUDA Basics. Murphy Stein New York University

Evaluation of CUDA Fortran for the CFD code Strukti

Lecture 1: an introduction to CUDA

Week 1 out-of-class notes, discussions and sample problems

10- High Performance Compu5ng

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Transcription:

~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU) Computer Architecture & Parallel Programming Laboratory (CAPPLab) Wichita, Kansas, USA Prepared on: November 21, 2012

Outline Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Introduction (Juggling) (Ultimate) Performance Multicore Architectures, Simultaneous Multithreading (SMT) Multicore with SMT provides the ultimate performance T/F? CAPPLab Researchers, Resources Research Activities (Multicore with SMT plus GPGPU/CUDA Technology) Discussion Contact Information QUESTIONS? Any time, please! Dr. Zaman 2

Introduction Presenters Dr. Abu Asaduzzaman Asst. Prof., Elec. Eng. & Computer Sci. Dept., WSU Director, WSU Computer Arch & Parallel Prog Lab (CAPPLab) (Juggling) http://www.youtube.com/watch?v=pqbla9ku8ze http://www.youtube.com/watch?v=5ayevg1a8_g&feature=related http://www.youtube.com/watch?v=s0d3fk9zhui Dr. Zaman 3

Performance (Single-Core to) Multicore Architecture History of Computing Word computer in 1613 (this is not the beginning) Von Neumann architecture (1945) data/instructions memory Harvard architecture (1944) data memory, instruction memory Single-Core Processors In most modern processors: split CL1 (I1, D1), unified CL2, Intel Pentium 4, AMD Athlon Classic, Popular Programming Languages C, Dr. Zaman 4

Performance (Single-Core to) Multicore Architecture Cache not shown Input Process/Store Output Multi-tasking Time sharing (Juggling!) Courtesy: Jernej Barbič, Carnegie Mellon University Dr. Zaman 5

Single-Core Core Performance A thread is a running process a single core Courtesy: Jernej Barbič, Carnegie Mellon University Dr. Zaman 6

Performance Major Steps to Execute an Instruction 68000 CPU and Memory 1: I.F. (5) W.B. (3) O.F. 16b D7 D0 Data Registers 31 16.8..0 Memory CPU A7 A7 A0 Address Registers 31 16.8..0 PC 31 16.8..0 24b (3) O.F. (5) W.B. 24b Start SR 15.8..0 4: I.E. ALU IR?? 16.8..0 Decoder / Control Unit 2: I.D. Dr. Zaman 7

Performance Thread 1: Integer (INT) (Pipelining Technique) Thread 1: Integer 4: Integer 1: Instruction Fetch 2: Instruction Decode (3) Operand(s) Fetch Arithmetic Logic Unit (5) Result Write Back Floating Point Dr. Zaman 8

Performance Thread 2: Floating Point (FP) (Pipelining Technique) Integer Instruction Fetch Instruction Decode Operand(s) Fetch Arithmetic Logic Unit Result Write Back Thread 2: Floating Point Floating Point Dr. Zaman 9

Performance Threads 1 and 2: INT and FP s (Pipelining Technique) Thread 1: Integer Integer POSSIBLE? Instruction Fetch Instruction Decode Operand(s) Fetch Arithmetic Logic Unit Result Write Back Thread 2: Floating Point Floating Point Dr. Zaman 10

Performance Threads 1 and 3: Integer s Thread 1: Integer Thread 3: Integer POSSIBLE? Integer Instruction Fetch Instruction Decode Operand(s) Fetch Arithmetic Logic Unit Result Write Back Floating Point Dr. Zaman 11

Performance Threads 1 and 3: Integer s (Multicore) Thread 1: Integer Integer Instruction Fetch Instruction Decode Operand(s) Fetch Arithmetic Logic Unit Result Write Back Core 1 Floating Point POSSIBLE? Thread 3: Integer Integer Instruction Fetch Core 2 Instruction Decode Operand(s) Fetch Dr. Zaman 12 Arithmetic Logic Unit Floating Point Result Write Back

Performance Threads 1, 2, 3, and 4: INT & FP s (Multicore) Instruction Fetch Core 1 Instruction Fetch Core 2 Thread 1: Integer Instruction Decode Thread 2: Floating Point Thread 3: Integer Instruction Decode Operand(s) Fetch Operand(s) Fetch Thread 4: Floating Point Dr. Zaman 13 Integer Arithmetic Logic Unit Floating Point Integer Arithmetic Logic Unit Floating Point Result Write Back Result Write Back POSSIBLE?

Performance Simultaneous Multithreading (SMT) Thread A running program (or code segment) is a process Process processes / threads Multithreading (IP4 Hyper-threading) Multiple threads running in a single-processor (time sharing) Simultaneous Multithreading (SMT) Multiple threads running in a single-processor at the same time Generating/Managing Multiple Threads OpenMP, Open MPI C Dr. Zaman 14

Performance Multicore Architecture Single-Core Processors Multiprocessor and Multicomputer Systems Multiple processors: shared/common memory, local memory Multiple processors: own private memory Multicore Processors Multiple cores on a single chip Working together, sharing resources Multicore Programming Language supports OpenMP, Open MPI C Dr. Zaman 15

Performance Parallel/Concurrent Computing Parallel Processing It is not fun! Paying the lunch bill together Friend Before Eating Total Bill Return Tip After Paying A $10 $1 B $10 $25 $5 $2 $1 C $10 $1 Total $30 $2 Total Spent $9 $9 $9 $27 Started with $30; spent $29 ($27 + $2) Where did $1 go? (Juggling!) Dr. Zaman 16

Ultimate Performance Multicore with SMT is it enough? Example: Matrix Multiplication [C] = [A] [B] 2 x 2 Matrix 8 (i.e., 2 * 2^2) multiplications 4 (i.e., 1 * 2^2) additions Dr. Zaman 17

Ultimate Performance Multicore with SMT is it enough? Example: Matrix Multiplication [C] = [A] [B] 3 x 3 Matrix; how many multiplications and additions? 27 (i.e., 3 * 3^2 i.e., 3^3) multiplications 18 (i.e., 2 * 3^2 i.e., (3 1) * 3^2) additions Dr. Zaman 18

Ultimate Performance Algorithm Design Techniques Example: Matrix Multiplication [C] = [A] [B] = 4 x 4 Matrix ; how many multiplications and additions? 64 (i.e., 4^3) multiplications N^3 multiplications 48 (i.e., 3 * 4^2) additions (N 1)N^2 additions Dr. Zaman 19

Ultimate Performance Algorithm Design Techniques Example: Matrix Multiplication 4 x 4 Matrix 64 (i.e., 4^3) multiplications 48 (i.e., 3 * 4^2) additions A A 1,1 A 1,2 B 2 x 2 Matrix 8 (i.e., 2^3) multiplications 4 (i.e., 1 * 2^2) additions Are we reducing *s/+s? What is the message? Dr. Zaman 20

Algorithm Design Techniques Example: Matrix Multiplication [C] = [A] [B] Ultimate Performance Say, we have unlimited 2 x 2 Matrix solvers with 8 MULT Then it takes only 2 * 8 MULT time unit Do we have unlimited solvers/cores? Dr. Zaman 21

Ultimate Performance GPGPU/CUDA Technology GPGPU General-Purpose computing on Graphics Processing Units (GPGPU, GPGP or less often GP²U). More for scientific usages. GPU Graphics Processing Units. Mainly for multimedia usages. CUDA CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA. Provides GPGPU programing interface. Dr. Zaman 22

Ultimate Performance GPGPU/CUDA Technology (Looking back: PCI) PCI (Peripheral Component Interconnect) CPU PCI Peripherals CPU PCI-E Peripherals PCI Express (Peripheral Component Interconnect Express) One CPU, Multiple GPUs (Juggling) GPU CPU GPU GPU Dr. Zaman 23

Ultimate Performance GPGPU/CUDA Technology GPU (the chip itself) consists of group of Streaming Multiprocessors (SM) Inside each SM: 32 cores (sharing the same instruction) 64KB shared memory (shared among the 32 cores) 32K 32bit registers 2 warp schedulers (to schedule instructions) 4 special function units (Juggling) Dr. Zaman 24

Ultimate Performance GPGPU/CUDA Technology GPU (the chip itself) consists of group of Streaming Multiprocessors (SM) Inside each SM: 32 cores (sharing the same instruction) 64KB shared memory (shared among the 32 cores) 32K 32bit registers 2 warp schedulers (to schedule instructions) 4 special function units (Juggling) Dr. Zaman 25

Ultimate Performance GPGPU/CUDA Technology The host (CPU) executes a kernel in GPU in 4 steps (Step 1) CPU allocates and copies data to GPU On CUDA API: cudamalloc() cudamemcpy() Dr. Zaman 26

Ultimate Performance GPGPU/CUDA Technology The host (CPU) executes a kernel in GPU in 4 steps (Step 2) CPU Sends function parameters and instructions to GPU CUDA API: myfunc<<<blocks, Threads>>>(parameters) Dr. Zaman 27

Ultimate Performance GPGPU/CUDA Technology The host (CPU) executes a kernel in GPU in 4 steps (Step 3) GPU executes instruction as scheduled in warps (Step 4) Results will need to be copied back to Host memory (RAM) using cudamemcpy() Dr. Zaman 28

Ultimate Performance GPGPU/CUDA Technology CUDA Threads are grouped into blocks. This is to optimize the use of memory. Instruction sent by host to GPU is called a Kernel. GPU sees a kernel as a grid of blocks of threads. Dr. Zaman 29

Ultimate Performance GPGPU/CUDA Technology Each CUDA thread will execute in one core. Depending on memory requirements of a kernel, multiple block may execute on each SM. Each kernel can only be executed by one device (unless programmer s intervention). Multiple kernels may be executed at one time. Dr. Zaman 30

Ultimate Performance Case Study 1 (data independent computation without GPU/CUDA) Matrix Multiplication Matrices Systems Dr. Zaman 31

Ultimate Performance Case Study 1 (data independent computation without GPU/CUDA) Matrix Multiplication Execution Time Power Consumption Dr. Zaman 32

Ultimate Performance Case Study 2 (data dependent computation without GPU/CUDA) Heat Transfer on 2D Surface Execution Time Power Consumption Dr. Zaman 33

Ultimate Performance Case Study 3 (data dependent computation with GPU/CUDA) Lightning Strike Protection (LSP) Dr. Zaman 34

Ultimate Performance Case Study 3 (data dependent computation with GPU/CUDA) Fast Effective LSP Simulation Many aerospace companies have incorporated fiber-reinforced composite materials into the fuselage, either partially or wholly because of the high strength-to-weight ratio, stiffness, and larger scale manufacturing abilities at any shape. However, the lack of lightning strike protection (LSP) for the composite materials limits their use in many applications. We propose a fast and effective simulation model using NVIDIA general purpose graphics processing unit (GPGPU) and compute unified device architecture (CUDA) technology which is targeted to LSP analysis on composite aircrafts. Dr. Zaman 35

Ultimate Performance Case Study 3 (data dependent computation with GPU/CUDA) Fast Effective LSP Simulation In many cases like lightning strikes on a composite material, when the charge distribution is not known, the Poisson's Equation can be used to solve any electrostatic problem. Using the Laplacian operator on the electric potential function over a region of the space where the charge density is not zero, the Poisson's Equation is: Dr. Zaman 36

Ultimate Performance Case Study 3 (data dependent computation with GPU/CUDA) Fast Effective LSP Simulation If the charge density is zero all over the region, the Poison's Equation becomes Laplace's Equation: A. Asaduzzaman, C. Yip, S. Kumar, and R. Asmatulu, Fast, Effective, and Adaptable Computer Modeling and Simulation of Lightning Strike Protection on Composite Materials, under preparation, IEEE SoutheastCon conference 2013, Jacksonville, Florida, April 4-7, 2013. Dr. Zaman 37

Ultimate Performance Case Study 3 (data dependent computation with GPU/CUDA) Fast Effective LSP Simulation Simulation CPU Only CPU/GPU w/o shared memory CPU/GPU with shared memory Dr. Zaman 38

Ultimate Performance Case Study 4 (data independent computation with GPU/CUDA) Quantum Computing On going Expecting collaboration with Dr. Kumar, EECS, WSU Other Areas Eco-Biological studies Medical studies More Dr. Zaman 39

Outline Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Introduction Juggling (Ultimate) Performance Multicore Architectures, Simultaneous Multithreading (SMT) Multicore with SMT provides the ultimate performance T/F? CAPPLab Researchers, Resources Research Activities (Multicore with SMT plus GPGPU/CUDA Technology) Discussion Contact Information QUESTIONS? Any time! Dr. Zaman 40

WSU CAPPLab CAPPLab Computer Architecture & Parallel Programming Laboratory (CAPPLab) Physical location: 245 Jabara Hall URL: http://www.cs.wichita.edu/~capplab/ E-mail: capplab@cs.wichita.edu Tel: +1-316-WSU-3927 Key Objectives Lead research in advanced-level computer architecture, highperformance computing, embedded systems, and related fields. Educate advanced-level computer architecture and parallel programming. Dr. Zaman 41

WSU CAPPLab Researchers Faculty Members Dr. Abu Asaduzzaman, Asst. Prof., EECS, WSU Students Chok M. Yip, MS Student, EECS Dept. Nasrin Sultana, MS Student, EECS Dept. Zachary A. Vickers, BS Student, EECS Dept. Hin Yun Lee, MS in CS, EECS Dept. Others Dr. Ramazan Asmatulu, Asso. Prof., ME, WSU Dr. Preethika Kumar, Asst. Prof., EECS, WSU Dr. Zaman 42

WSU CAPPLab Resources Hardware 1: CUDA Server CPU: Xeon E5506, 2x 4-core, 2.13 GHz, 8GB DDR3; GPU: Telsa C2075, 14x 32 cores, 6GB GDDR5 memory 2: CUDA PC CPU: Xeon E5506, 3: Supercomputer (Opteron 6134, 32 cores per node, 2.3 GHz, 64 GB DDR3) via remote access to WSU (HiPeCC) 2 CUDA enabled Windows Workstations/PCs 1 CUDA enabled Laptops More Software As needed Dr. Zaman 43

WSU CAPPLab Past/Current Activities WSU became CUDA Teaching Center for 2012-13 Support from NVIDIA Teaching parallel programming Workshop GPGPU/CUDA/C Summer 2012 (10 participants) GPGPU/CUDA/C Summer 2013 Collaborative Research Dr. Ramazan Asmatulu Dr. Preethika Kumar More Dr. Zaman 44

WSU CAPPLab Past/Current/Future Activities Research Funding M2SYS-WSU Biometric Cloud Computing Research Project Teaching (Hardware/Financial) supports from NVIDIA Research Funding MURPA, pending, ORA, WSU NSF TUES Type-1, pending, NSF Preparing for NSF, and other external agencies Dr. Zaman 45

Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Thank You! Contact: Abu Asaduzzaman E-mail: abuasaduzzaman@ieee.org Phone: +1-316-978-5261 http://webs.wichita.edu/aasaduzzaman/ http://www.cs.wichita.edu/~capplab/