~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU) Computer Architecture & Parallel Programming Laboratory (CAPPLab) Wichita, Kansas, USA Prepared on: November 21, 2012
Outline Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Introduction (Juggling) (Ultimate) Performance Multicore Architectures, Simultaneous Multithreading (SMT) Multicore with SMT provides the ultimate performance T/F? CAPPLab Researchers, Resources Research Activities (Multicore with SMT plus GPGPU/CUDA Technology) Discussion Contact Information QUESTIONS? Any time, please! Dr. Zaman 2
Introduction Presenters Dr. Abu Asaduzzaman Asst. Prof., Elec. Eng. & Computer Sci. Dept., WSU Director, WSU Computer Arch & Parallel Prog Lab (CAPPLab) (Juggling) http://www.youtube.com/watch?v=pqbla9ku8ze http://www.youtube.com/watch?v=5ayevg1a8_g&feature=related http://www.youtube.com/watch?v=s0d3fk9zhui Dr. Zaman 3
Performance (Single-Core to) Multicore Architecture History of Computing Word computer in 1613 (this is not the beginning) Von Neumann architecture (1945) data/instructions memory Harvard architecture (1944) data memory, instruction memory Single-Core Processors In most modern processors: split CL1 (I1, D1), unified CL2, Intel Pentium 4, AMD Athlon Classic, Popular Programming Languages C, Dr. Zaman 4
Performance (Single-Core to) Multicore Architecture Cache not shown Input Process/Store Output Multi-tasking Time sharing (Juggling!) Courtesy: Jernej Barbič, Carnegie Mellon University Dr. Zaman 5
Single-Core Core Performance A thread is a running process a single core Courtesy: Jernej Barbič, Carnegie Mellon University Dr. Zaman 6
Performance Major Steps to Execute an Instruction 68000 CPU and Memory 1: I.F. (5) W.B. (3) O.F. 16b D7 D0 Data Registers 31 16.8..0 Memory CPU A7 A7 A0 Address Registers 31 16.8..0 PC 31 16.8..0 24b (3) O.F. (5) W.B. 24b Start SR 15.8..0 4: I.E. ALU IR?? 16.8..0 Decoder / Control Unit 2: I.D. Dr. Zaman 7
Performance Thread 1: Integer (INT) (Pipelining Technique) Thread 1: Integer 4: Integer 1: Instruction Fetch 2: Instruction Decode (3) Operand(s) Fetch Arithmetic Logic Unit (5) Result Write Back Floating Point Dr. Zaman 8
Performance Thread 2: Floating Point (FP) (Pipelining Technique) Integer Instruction Fetch Instruction Decode Operand(s) Fetch Arithmetic Logic Unit Result Write Back Thread 2: Floating Point Floating Point Dr. Zaman 9
Performance Threads 1 and 2: INT and FP s (Pipelining Technique) Thread 1: Integer Integer POSSIBLE? Instruction Fetch Instruction Decode Operand(s) Fetch Arithmetic Logic Unit Result Write Back Thread 2: Floating Point Floating Point Dr. Zaman 10
Performance Threads 1 and 3: Integer s Thread 1: Integer Thread 3: Integer POSSIBLE? Integer Instruction Fetch Instruction Decode Operand(s) Fetch Arithmetic Logic Unit Result Write Back Floating Point Dr. Zaman 11
Performance Threads 1 and 3: Integer s (Multicore) Thread 1: Integer Integer Instruction Fetch Instruction Decode Operand(s) Fetch Arithmetic Logic Unit Result Write Back Core 1 Floating Point POSSIBLE? Thread 3: Integer Integer Instruction Fetch Core 2 Instruction Decode Operand(s) Fetch Dr. Zaman 12 Arithmetic Logic Unit Floating Point Result Write Back
Performance Threads 1, 2, 3, and 4: INT & FP s (Multicore) Instruction Fetch Core 1 Instruction Fetch Core 2 Thread 1: Integer Instruction Decode Thread 2: Floating Point Thread 3: Integer Instruction Decode Operand(s) Fetch Operand(s) Fetch Thread 4: Floating Point Dr. Zaman 13 Integer Arithmetic Logic Unit Floating Point Integer Arithmetic Logic Unit Floating Point Result Write Back Result Write Back POSSIBLE?
Performance Simultaneous Multithreading (SMT) Thread A running program (or code segment) is a process Process processes / threads Multithreading (IP4 Hyper-threading) Multiple threads running in a single-processor (time sharing) Simultaneous Multithreading (SMT) Multiple threads running in a single-processor at the same time Generating/Managing Multiple Threads OpenMP, Open MPI C Dr. Zaman 14
Performance Multicore Architecture Single-Core Processors Multiprocessor and Multicomputer Systems Multiple processors: shared/common memory, local memory Multiple processors: own private memory Multicore Processors Multiple cores on a single chip Working together, sharing resources Multicore Programming Language supports OpenMP, Open MPI C Dr. Zaman 15
Performance Parallel/Concurrent Computing Parallel Processing It is not fun! Paying the lunch bill together Friend Before Eating Total Bill Return Tip After Paying A $10 $1 B $10 $25 $5 $2 $1 C $10 $1 Total $30 $2 Total Spent $9 $9 $9 $27 Started with $30; spent $29 ($27 + $2) Where did $1 go? (Juggling!) Dr. Zaman 16
Ultimate Performance Multicore with SMT is it enough? Example: Matrix Multiplication [C] = [A] [B] 2 x 2 Matrix 8 (i.e., 2 * 2^2) multiplications 4 (i.e., 1 * 2^2) additions Dr. Zaman 17
Ultimate Performance Multicore with SMT is it enough? Example: Matrix Multiplication [C] = [A] [B] 3 x 3 Matrix; how many multiplications and additions? 27 (i.e., 3 * 3^2 i.e., 3^3) multiplications 18 (i.e., 2 * 3^2 i.e., (3 1) * 3^2) additions Dr. Zaman 18
Ultimate Performance Algorithm Design Techniques Example: Matrix Multiplication [C] = [A] [B] = 4 x 4 Matrix ; how many multiplications and additions? 64 (i.e., 4^3) multiplications N^3 multiplications 48 (i.e., 3 * 4^2) additions (N 1)N^2 additions Dr. Zaman 19
Ultimate Performance Algorithm Design Techniques Example: Matrix Multiplication 4 x 4 Matrix 64 (i.e., 4^3) multiplications 48 (i.e., 3 * 4^2) additions A A 1,1 A 1,2 B 2 x 2 Matrix 8 (i.e., 2^3) multiplications 4 (i.e., 1 * 2^2) additions Are we reducing *s/+s? What is the message? Dr. Zaman 20
Algorithm Design Techniques Example: Matrix Multiplication [C] = [A] [B] Ultimate Performance Say, we have unlimited 2 x 2 Matrix solvers with 8 MULT Then it takes only 2 * 8 MULT time unit Do we have unlimited solvers/cores? Dr. Zaman 21
Ultimate Performance GPGPU/CUDA Technology GPGPU General-Purpose computing on Graphics Processing Units (GPGPU, GPGP or less often GP²U). More for scientific usages. GPU Graphics Processing Units. Mainly for multimedia usages. CUDA CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA. Provides GPGPU programing interface. Dr. Zaman 22
Ultimate Performance GPGPU/CUDA Technology (Looking back: PCI) PCI (Peripheral Component Interconnect) CPU PCI Peripherals CPU PCI-E Peripherals PCI Express (Peripheral Component Interconnect Express) One CPU, Multiple GPUs (Juggling) GPU CPU GPU GPU Dr. Zaman 23
Ultimate Performance GPGPU/CUDA Technology GPU (the chip itself) consists of group of Streaming Multiprocessors (SM) Inside each SM: 32 cores (sharing the same instruction) 64KB shared memory (shared among the 32 cores) 32K 32bit registers 2 warp schedulers (to schedule instructions) 4 special function units (Juggling) Dr. Zaman 24
Ultimate Performance GPGPU/CUDA Technology GPU (the chip itself) consists of group of Streaming Multiprocessors (SM) Inside each SM: 32 cores (sharing the same instruction) 64KB shared memory (shared among the 32 cores) 32K 32bit registers 2 warp schedulers (to schedule instructions) 4 special function units (Juggling) Dr. Zaman 25
Ultimate Performance GPGPU/CUDA Technology The host (CPU) executes a kernel in GPU in 4 steps (Step 1) CPU allocates and copies data to GPU On CUDA API: cudamalloc() cudamemcpy() Dr. Zaman 26
Ultimate Performance GPGPU/CUDA Technology The host (CPU) executes a kernel in GPU in 4 steps (Step 2) CPU Sends function parameters and instructions to GPU CUDA API: myfunc<<<blocks, Threads>>>(parameters) Dr. Zaman 27
Ultimate Performance GPGPU/CUDA Technology The host (CPU) executes a kernel in GPU in 4 steps (Step 3) GPU executes instruction as scheduled in warps (Step 4) Results will need to be copied back to Host memory (RAM) using cudamemcpy() Dr. Zaman 28
Ultimate Performance GPGPU/CUDA Technology CUDA Threads are grouped into blocks. This is to optimize the use of memory. Instruction sent by host to GPU is called a Kernel. GPU sees a kernel as a grid of blocks of threads. Dr. Zaman 29
Ultimate Performance GPGPU/CUDA Technology Each CUDA thread will execute in one core. Depending on memory requirements of a kernel, multiple block may execute on each SM. Each kernel can only be executed by one device (unless programmer s intervention). Multiple kernels may be executed at one time. Dr. Zaman 30
Ultimate Performance Case Study 1 (data independent computation without GPU/CUDA) Matrix Multiplication Matrices Systems Dr. Zaman 31
Ultimate Performance Case Study 1 (data independent computation without GPU/CUDA) Matrix Multiplication Execution Time Power Consumption Dr. Zaman 32
Ultimate Performance Case Study 2 (data dependent computation without GPU/CUDA) Heat Transfer on 2D Surface Execution Time Power Consumption Dr. Zaman 33
Ultimate Performance Case Study 3 (data dependent computation with GPU/CUDA) Lightning Strike Protection (LSP) Dr. Zaman 34
Ultimate Performance Case Study 3 (data dependent computation with GPU/CUDA) Fast Effective LSP Simulation Many aerospace companies have incorporated fiber-reinforced composite materials into the fuselage, either partially or wholly because of the high strength-to-weight ratio, stiffness, and larger scale manufacturing abilities at any shape. However, the lack of lightning strike protection (LSP) for the composite materials limits their use in many applications. We propose a fast and effective simulation model using NVIDIA general purpose graphics processing unit (GPGPU) and compute unified device architecture (CUDA) technology which is targeted to LSP analysis on composite aircrafts. Dr. Zaman 35
Ultimate Performance Case Study 3 (data dependent computation with GPU/CUDA) Fast Effective LSP Simulation In many cases like lightning strikes on a composite material, when the charge distribution is not known, the Poisson's Equation can be used to solve any electrostatic problem. Using the Laplacian operator on the electric potential function over a region of the space where the charge density is not zero, the Poisson's Equation is: Dr. Zaman 36
Ultimate Performance Case Study 3 (data dependent computation with GPU/CUDA) Fast Effective LSP Simulation If the charge density is zero all over the region, the Poison's Equation becomes Laplace's Equation: A. Asaduzzaman, C. Yip, S. Kumar, and R. Asmatulu, Fast, Effective, and Adaptable Computer Modeling and Simulation of Lightning Strike Protection on Composite Materials, under preparation, IEEE SoutheastCon conference 2013, Jacksonville, Florida, April 4-7, 2013. Dr. Zaman 37
Ultimate Performance Case Study 3 (data dependent computation with GPU/CUDA) Fast Effective LSP Simulation Simulation CPU Only CPU/GPU w/o shared memory CPU/GPU with shared memory Dr. Zaman 38
Ultimate Performance Case Study 4 (data independent computation with GPU/CUDA) Quantum Computing On going Expecting collaboration with Dr. Kumar, EECS, WSU Other Areas Eco-Biological studies Medical studies More Dr. Zaman 39
Outline Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Introduction Juggling (Ultimate) Performance Multicore Architectures, Simultaneous Multithreading (SMT) Multicore with SMT provides the ultimate performance T/F? CAPPLab Researchers, Resources Research Activities (Multicore with SMT plus GPGPU/CUDA Technology) Discussion Contact Information QUESTIONS? Any time! Dr. Zaman 40
WSU CAPPLab CAPPLab Computer Architecture & Parallel Programming Laboratory (CAPPLab) Physical location: 245 Jabara Hall URL: http://www.cs.wichita.edu/~capplab/ E-mail: capplab@cs.wichita.edu Tel: +1-316-WSU-3927 Key Objectives Lead research in advanced-level computer architecture, highperformance computing, embedded systems, and related fields. Educate advanced-level computer architecture and parallel programming. Dr. Zaman 41
WSU CAPPLab Researchers Faculty Members Dr. Abu Asaduzzaman, Asst. Prof., EECS, WSU Students Chok M. Yip, MS Student, EECS Dept. Nasrin Sultana, MS Student, EECS Dept. Zachary A. Vickers, BS Student, EECS Dept. Hin Yun Lee, MS in CS, EECS Dept. Others Dr. Ramazan Asmatulu, Asso. Prof., ME, WSU Dr. Preethika Kumar, Asst. Prof., EECS, WSU Dr. Zaman 42
WSU CAPPLab Resources Hardware 1: CUDA Server CPU: Xeon E5506, 2x 4-core, 2.13 GHz, 8GB DDR3; GPU: Telsa C2075, 14x 32 cores, 6GB GDDR5 memory 2: CUDA PC CPU: Xeon E5506, 3: Supercomputer (Opteron 6134, 32 cores per node, 2.3 GHz, 64 GB DDR3) via remote access to WSU (HiPeCC) 2 CUDA enabled Windows Workstations/PCs 1 CUDA enabled Laptops More Software As needed Dr. Zaman 43
WSU CAPPLab Past/Current Activities WSU became CUDA Teaching Center for 2012-13 Support from NVIDIA Teaching parallel programming Workshop GPGPU/CUDA/C Summer 2012 (10 participants) GPGPU/CUDA/C Summer 2013 Collaborative Research Dr. Ramazan Asmatulu Dr. Preethika Kumar More Dr. Zaman 44
WSU CAPPLab Past/Current/Future Activities Research Funding M2SYS-WSU Biometric Cloud Computing Research Project Teaching (Hardware/Financial) supports from NVIDIA Research Funding MURPA, pending, ORA, WSU NSF TUES Type-1, pending, NSF Preparing for NSF, and other external agencies Dr. Zaman 45
Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Thank You! Contact: Abu Asaduzzaman E-mail: abuasaduzzaman@ieee.org Phone: +1-316-978-5261 http://webs.wichita.edu/aasaduzzaman/ http://www.cs.wichita.edu/~capplab/