~ Greetings from WSU CAPPLab ~
|
|
- Hester Harper
- 8 years ago
- Views:
Transcription
1 ~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU) Computer Architecture & Parallel Programming Laboratory (CAPPLab) Wichita, Kansas, USA Prepared on: November 21, 2012
2 Outline Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Introduction (Juggling) (Ultimate) Performance Multicore Architectures, Simultaneous Multithreading (SMT) Multicore with SMT provides the ultimate performance T/F? CAPPLab Researchers, Resources Research Activities (Multicore with SMT plus GPGPU/CUDA Technology) Discussion Contact Information QUESTIONS? Any time, please! Dr. Zaman 2
3 Introduction Presenters Dr. Abu Asaduzzaman Asst. Prof., Elec. Eng. & Computer Sci. Dept., WSU Director, WSU Computer Arch & Parallel Prog Lab (CAPPLab) (Juggling) Dr. Zaman 3
4 Performance (Single-Core to) Multicore Architecture History of Computing Word computer in 1613 (this is not the beginning) Von Neumann architecture (1945) data/instructions memory Harvard architecture (1944) data memory, instruction memory Single-Core Processors In most modern processors: split CL1 (I1, D1), unified CL2, Intel Pentium 4, AMD Athlon Classic, Popular Programming Languages C, Dr. Zaman 4
5 Performance (Single-Core to) Multicore Architecture Cache not shown Input Process/Store Output Multi-tasking Time sharing (Juggling!) Courtesy: Jernej Barbič, Carnegie Mellon University Dr. Zaman 5
6 Single-Core Core Performance A thread is a running process a single core Courtesy: Jernej Barbič, Carnegie Mellon University Dr. Zaman 6
7 Performance Major Steps to Execute an Instruction CPU and Memory 1: I.F. (5) W.B. (3) O.F. 16b D7 D0 Data Registers Memory CPU A7 A7 A0 Address Registers PC b (3) O.F. (5) W.B. 24b Start SR : I.E. ALU IR?? Decoder / Control Unit 2: I.D. Dr. Zaman 7
8 Performance Thread 1: Integer (INT) (Pipelining Technique) Thread 1: Integer 4: Integer 1: Instruction Fetch 2: Instruction Decode (3) Operand(s) Fetch Arithmetic Logic Unit (5) Result Write Back Floating Point Dr. Zaman 8
9 Performance Thread 2: Floating Point (FP) (Pipelining Technique) Integer Instruction Fetch Instruction Decode Operand(s) Fetch Arithmetic Logic Unit Result Write Back Thread 2: Floating Point Floating Point Dr. Zaman 9
10 Performance Threads 1 and 2: INT and FP s (Pipelining Technique) Thread 1: Integer Integer POSSIBLE? Instruction Fetch Instruction Decode Operand(s) Fetch Arithmetic Logic Unit Result Write Back Thread 2: Floating Point Floating Point Dr. Zaman 10
11 Performance Threads 1 and 3: Integer s Thread 1: Integer Thread 3: Integer POSSIBLE? Integer Instruction Fetch Instruction Decode Operand(s) Fetch Arithmetic Logic Unit Result Write Back Floating Point Dr. Zaman 11
12 Performance Threads 1 and 3: Integer s (Multicore) Thread 1: Integer Integer Instruction Fetch Instruction Decode Operand(s) Fetch Arithmetic Logic Unit Result Write Back Core 1 Floating Point POSSIBLE? Thread 3: Integer Integer Instruction Fetch Core 2 Instruction Decode Operand(s) Fetch Dr. Zaman 12 Arithmetic Logic Unit Floating Point Result Write Back
13 Performance Threads 1, 2, 3, and 4: INT & FP s (Multicore) Instruction Fetch Core 1 Instruction Fetch Core 2 Thread 1: Integer Instruction Decode Thread 2: Floating Point Thread 3: Integer Instruction Decode Operand(s) Fetch Operand(s) Fetch Thread 4: Floating Point Dr. Zaman 13 Integer Arithmetic Logic Unit Floating Point Integer Arithmetic Logic Unit Floating Point Result Write Back Result Write Back POSSIBLE?
14 Performance Simultaneous Multithreading (SMT) Thread A running program (or code segment) is a process Process processes / threads Multithreading (IP4 Hyper-threading) Multiple threads running in a single-processor (time sharing) Simultaneous Multithreading (SMT) Multiple threads running in a single-processor at the same time Generating/Managing Multiple Threads OpenMP, Open MPI C Dr. Zaman 14
15 Performance Multicore Architecture Single-Core Processors Multiprocessor and Multicomputer Systems Multiple processors: shared/common memory, local memory Multiple processors: own private memory Multicore Processors Multiple cores on a single chip Working together, sharing resources Multicore Programming Language supports OpenMP, Open MPI C Dr. Zaman 15
16 Performance Parallel/Concurrent Computing Parallel Processing It is not fun! Paying the lunch bill together Friend Before Eating Total Bill Return Tip After Paying A $10 $1 B $10 $25 $5 $2 $1 C $10 $1 Total $30 $2 Total Spent $9 $9 $9 $27 Started with $30; spent $29 ($27 + $2) Where did $1 go? (Juggling!) Dr. Zaman 16
17 Ultimate Performance Multicore with SMT is it enough? Example: Matrix Multiplication [C] = [A] [B] 2 x 2 Matrix 8 (i.e., 2 * 2^2) multiplications 4 (i.e., 1 * 2^2) additions Dr. Zaman 17
18 Ultimate Performance Multicore with SMT is it enough? Example: Matrix Multiplication [C] = [A] [B] 3 x 3 Matrix; how many multiplications and additions? 27 (i.e., 3 * 3^2 i.e., 3^3) multiplications 18 (i.e., 2 * 3^2 i.e., (3 1) * 3^2) additions Dr. Zaman 18
19 Ultimate Performance Algorithm Design Techniques Example: Matrix Multiplication [C] = [A] [B] = 4 x 4 Matrix ; how many multiplications and additions? 64 (i.e., 4^3) multiplications N^3 multiplications 48 (i.e., 3 * 4^2) additions (N 1)N^2 additions Dr. Zaman 19
20 Ultimate Performance Algorithm Design Techniques Example: Matrix Multiplication 4 x 4 Matrix 64 (i.e., 4^3) multiplications 48 (i.e., 3 * 4^2) additions A A 1,1 A 1,2 B 2 x 2 Matrix 8 (i.e., 2^3) multiplications 4 (i.e., 1 * 2^2) additions Are we reducing *s/+s? What is the message? Dr. Zaman 20
21 Algorithm Design Techniques Example: Matrix Multiplication [C] = [A] [B] Ultimate Performance Say, we have unlimited 2 x 2 Matrix solvers with 8 MULT Then it takes only 2 * 8 MULT time unit Do we have unlimited solvers/cores? Dr. Zaman 21
22 Ultimate Performance GPGPU/CUDA Technology GPGPU General-Purpose computing on Graphics Processing Units (GPGPU, GPGP or less often GP²U). More for scientific usages. GPU Graphics Processing Units. Mainly for multimedia usages. CUDA CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA. Provides GPGPU programing interface. Dr. Zaman 22
23 Ultimate Performance GPGPU/CUDA Technology (Looking back: PCI) PCI (Peripheral Component Interconnect) CPU PCI Peripherals CPU PCI-E Peripherals PCI Express (Peripheral Component Interconnect Express) One CPU, Multiple GPUs (Juggling) GPU CPU GPU GPU Dr. Zaman 23
24 Ultimate Performance GPGPU/CUDA Technology GPU (the chip itself) consists of group of Streaming Multiprocessors (SM) Inside each SM: 32 cores (sharing the same instruction) 64KB shared memory (shared among the 32 cores) 32K 32bit registers 2 warp schedulers (to schedule instructions) 4 special function units (Juggling) Dr. Zaman 24
25 Ultimate Performance GPGPU/CUDA Technology GPU (the chip itself) consists of group of Streaming Multiprocessors (SM) Inside each SM: 32 cores (sharing the same instruction) 64KB shared memory (shared among the 32 cores) 32K 32bit registers 2 warp schedulers (to schedule instructions) 4 special function units (Juggling) Dr. Zaman 25
26 Ultimate Performance GPGPU/CUDA Technology The host (CPU) executes a kernel in GPU in 4 steps (Step 1) CPU allocates and copies data to GPU On CUDA API: cudamalloc() cudamemcpy() Dr. Zaman 26
27 Ultimate Performance GPGPU/CUDA Technology The host (CPU) executes a kernel in GPU in 4 steps (Step 2) CPU Sends function parameters and instructions to GPU CUDA API: myfunc<<<blocks, Threads>>>(parameters) Dr. Zaman 27
28 Ultimate Performance GPGPU/CUDA Technology The host (CPU) executes a kernel in GPU in 4 steps (Step 3) GPU executes instruction as scheduled in warps (Step 4) Results will need to be copied back to Host memory (RAM) using cudamemcpy() Dr. Zaman 28
29 Ultimate Performance GPGPU/CUDA Technology CUDA Threads are grouped into blocks. This is to optimize the use of memory. Instruction sent by host to GPU is called a Kernel. GPU sees a kernel as a grid of blocks of threads. Dr. Zaman 29
30 Ultimate Performance GPGPU/CUDA Technology Each CUDA thread will execute in one core. Depending on memory requirements of a kernel, multiple block may execute on each SM. Each kernel can only be executed by one device (unless programmer s intervention). Multiple kernels may be executed at one time. Dr. Zaman 30
31 Ultimate Performance Case Study 1 (data independent computation without GPU/CUDA) Matrix Multiplication Matrices Systems Dr. Zaman 31
32 Ultimate Performance Case Study 1 (data independent computation without GPU/CUDA) Matrix Multiplication Execution Time Power Consumption Dr. Zaman 32
33 Ultimate Performance Case Study 2 (data dependent computation without GPU/CUDA) Heat Transfer on 2D Surface Execution Time Power Consumption Dr. Zaman 33
34 Ultimate Performance Case Study 3 (data dependent computation with GPU/CUDA) Lightning Strike Protection (LSP) Dr. Zaman 34
35 Ultimate Performance Case Study 3 (data dependent computation with GPU/CUDA) Fast Effective LSP Simulation Many aerospace companies have incorporated fiber-reinforced composite materials into the fuselage, either partially or wholly because of the high strength-to-weight ratio, stiffness, and larger scale manufacturing abilities at any shape. However, the lack of lightning strike protection (LSP) for the composite materials limits their use in many applications. We propose a fast and effective simulation model using NVIDIA general purpose graphics processing unit (GPGPU) and compute unified device architecture (CUDA) technology which is targeted to LSP analysis on composite aircrafts. Dr. Zaman 35
36 Ultimate Performance Case Study 3 (data dependent computation with GPU/CUDA) Fast Effective LSP Simulation In many cases like lightning strikes on a composite material, when the charge distribution is not known, the Poisson's Equation can be used to solve any electrostatic problem. Using the Laplacian operator on the electric potential function over a region of the space where the charge density is not zero, the Poisson's Equation is: Dr. Zaman 36
37 Ultimate Performance Case Study 3 (data dependent computation with GPU/CUDA) Fast Effective LSP Simulation If the charge density is zero all over the region, the Poison's Equation becomes Laplace's Equation: A. Asaduzzaman, C. Yip, S. Kumar, and R. Asmatulu, Fast, Effective, and Adaptable Computer Modeling and Simulation of Lightning Strike Protection on Composite Materials, under preparation, IEEE SoutheastCon conference 2013, Jacksonville, Florida, April 4-7, Dr. Zaman 37
38 Ultimate Performance Case Study 3 (data dependent computation with GPU/CUDA) Fast Effective LSP Simulation Simulation CPU Only CPU/GPU w/o shared memory CPU/GPU with shared memory Dr. Zaman 38
39 Ultimate Performance Case Study 4 (data independent computation with GPU/CUDA) Quantum Computing On going Expecting collaboration with Dr. Kumar, EECS, WSU Other Areas Eco-Biological studies Medical studies More Dr. Zaman 39
40 Outline Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Introduction Juggling (Ultimate) Performance Multicore Architectures, Simultaneous Multithreading (SMT) Multicore with SMT provides the ultimate performance T/F? CAPPLab Researchers, Resources Research Activities (Multicore with SMT plus GPGPU/CUDA Technology) Discussion Contact Information QUESTIONS? Any time! Dr. Zaman 40
41 WSU CAPPLab CAPPLab Computer Architecture & Parallel Programming Laboratory (CAPPLab) Physical location: 245 Jabara Hall URL: Tel: WSU-3927 Key Objectives Lead research in advanced-level computer architecture, highperformance computing, embedded systems, and related fields. Educate advanced-level computer architecture and parallel programming. Dr. Zaman 41
42 WSU CAPPLab Researchers Faculty Members Dr. Abu Asaduzzaman, Asst. Prof., EECS, WSU Students Chok M. Yip, MS Student, EECS Dept. Nasrin Sultana, MS Student, EECS Dept. Zachary A. Vickers, BS Student, EECS Dept. Hin Yun Lee, MS in CS, EECS Dept. Others Dr. Ramazan Asmatulu, Asso. Prof., ME, WSU Dr. Preethika Kumar, Asst. Prof., EECS, WSU Dr. Zaman 42
43 WSU CAPPLab Resources Hardware 1: CUDA Server CPU: Xeon E5506, 2x 4-core, 2.13 GHz, 8GB DDR3; GPU: Telsa C2075, 14x 32 cores, 6GB GDDR5 memory 2: CUDA PC CPU: Xeon E5506, 3: Supercomputer (Opteron 6134, 32 cores per node, 2.3 GHz, 64 GB DDR3) via remote access to WSU (HiPeCC) 2 CUDA enabled Windows Workstations/PCs 1 CUDA enabled Laptops More Software As needed Dr. Zaman 43
44 WSU CAPPLab Past/Current Activities WSU became CUDA Teaching Center for Support from NVIDIA Teaching parallel programming Workshop GPGPU/CUDA/C Summer 2012 (10 participants) GPGPU/CUDA/C Summer 2013 Collaborative Research Dr. Ramazan Asmatulu Dr. Preethika Kumar More Dr. Zaman 44
45 WSU CAPPLab Past/Current/Future Activities Research Funding M2SYS-WSU Biometric Cloud Computing Research Project Teaching (Hardware/Financial) supports from NVIDIA Research Funding MURPA, pending, ORA, WSU NSF TUES Type-1, pending, NSF Preparing for NSF, and other external agencies Dr. Zaman 45
46 Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Thank You! Contact: Abu Asaduzzaman Phone:
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationGPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
More informationMulti-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
More informationNext Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationGraphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
More informationIntroduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
More informationLecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU
More informationProgram Grid and HPC5+ workshop
Program Grid and HPC5+ workshop 24-30, Bahman 1391 Tuesday Wednesday 9.00-9.45 9.45-10.30 Break 11.00-11.45 11.45-12.30 Lunch 14.00-17.00 Workshop Rouhani Karimi MosalmanTabar Karimi G+MMT+K Opening IPM_Grid
More informationEnhancing Cloud-based Servers by GPU/CPU Virtualization Management
Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department
More informationIntroduction to GPU Computing
Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture
More informationLBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
More informationStream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationCUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
More informationComputer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
More informationNVIDIA GeForce GTX 580 GPU Datasheet
NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines
More informationIntroduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
More informationIntroducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
More informationApplications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
More informationGPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
More informationGPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
More informationNVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
More informationSeeking Opportunities for Hardware Acceleration in Big Data Analytics
Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who
More informationCS 394 Introduction to Computer Architecture Spring 2012
CS 394 Introduction to Computer Architecture Spring 2012 Class Room/Hours: NA (Online course) Lab Room/Hours: NA Instructor: Abu Asaduzzaman (Dr. Zaman) Office Room: 253 Jabara Hall E-mail: Abu.Asaduzzaman@wichita.edu
More informationEnabling Technologies for Distributed and Cloud Computing
Enabling Technologies for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Multi-core CPUs and Multithreading
More informationE6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
More informationGenerations of the computer. processors.
. Piotr Gwizdała 1 Contents 1 st Generation 2 nd Generation 3 rd Generation 4 th Generation 5 th Generation 6 th Generation 7 th Generation 8 th Generation Dual Core generation Improves and actualizations
More informationIntroduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it
t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
More informationGraphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
More informationParallel Image Processing with CUDA A case study with the Canny Edge Detection Filter
Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná - Brazil Hochschule Regensburg 02.05.2011 Daniel
More informationHigh Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
More informationANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING
ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,
More informationA GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833
More informationGeoImaging Accelerator Pansharp Test Results
GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance
More informationGPU Computing - CUDA
GPU Computing - CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective
More informationCS2101a Foundations of Programming for High Performance Computing
CS2101a Foundations of Programming for High Performance Computing Marc Moreno Maza & Ning Xie University of Western Ontario, London, Ontario (Canada) CS2101 Plan 1 Course Overview 2 Hardware Acceleration
More informationTrends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
More informationHIGH PERFORMANCE CONSULTING COURSE OFFERINGS
Performance 1(6) HIGH PERFORMANCE CONSULTING COURSE OFFERINGS LEARN TO TAKE ADVANTAGE OF POWERFUL GPU BASED ACCELERATOR TECHNOLOGY TODAY 2006 2013 Nvidia GPUs Intel CPUs CONTENTS Acronyms and Terminology...
More informationLogical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.
Objectives The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Identify the components of the central processing unit and how they work together and interact with memory Describe how
More informationChapter 2 Parallel Computer Architecture
Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general
More informationPerformance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
More informationOptimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing
More informationwhat operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?
Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the
More informationThe Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens
More informationClustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu ren.wu@hp.com Bin Zhang bin.zhang2@hp.com Meichun Hsu meichun.hsu@hp.com ABSTRACT In this paper, we report our research on using GPUs to accelerate
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationDiscovering Computers 2011. Living in a Digital World
Discovering Computers 2011 Living in a Digital World Objectives Overview Differentiate among various styles of system units on desktop computers, notebook computers, and mobile devices Identify chips,
More informationEmbedded Systems: map to FPGA, GPU, CPU?
Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven jos@vectorfabrics.com Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware
More informationACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationProgramming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
More informationGPU Architecture. Michael Doggett ATI
GPU Architecture Michael Doggett ATI GPU Architecture RADEON X1800/X1900 Microsoft s XBOX360 Xenos GPU GPU research areas ATI - Driving the Visual Experience Everywhere Products from cell phones to super
More informationADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit
More informationADVANCED COMPUTER ARCHITECTURE
ADVANCED COMPUTER ARCHITECTURE Marco Ferretti Tel. Ufficio: 0382 985365 E-mail: marco.ferretti@unipv.it Web: www.unipv.it/mferretti, eecs.unipv.it 1 Course syllabus and motivations This course covers the
More informationEnabling Technologies for Distributed Computing
Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing, UNF Multi-core CPUs and Multithreading Technologies
More informationOpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
More informationGraphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
More informationGPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
More informationDesign and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
More informationLet s put together a Manual Processor
Lecture 14 Let s put together a Manual Processor Hardware Lecture 14 Slide 1 The processor Inside every computer there is at least one processor which can take an instruction, some operands and produce
More informationChapter 4 System Unit Components. Discovering Computers 2012. Your Interactive Guide to the Digital World
Chapter 4 System Unit Components Discovering Computers 2012 Your Interactive Guide to the Digital World Objectives Overview Differentiate among various styles of system units on desktop computers, notebook
More informationArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer
ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop Emily Apsey Performance Engineer Presentation Overview What it takes to successfully virtualize ArcGIS Pro in Citrix XenApp and XenDesktop - Shareable
More informationHigh Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates
High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of
More informationOverview. CPU Manufacturers. Current Intel and AMD Offerings
Central Processor Units (CPUs) Overview... 1 CPU Manufacturers... 1 Current Intel and AMD Offerings... 1 Evolution of Intel Processors... 3 S-Spec Code... 5 Basic Components of a CPU... 6 The CPU Die and
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
More informationLearning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design
Learning Outcomes Simple CPU Operation and Buses Dr Eddie Edwards eddie.edwards@imperial.ac.uk At the end of this lecture you will Understand how a CPU might be put together Be able to name the basic components
More informationTurbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
More informationA Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures
11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the
More informationCPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1
CPU Session 1 Praktikum Parallele Rechnerarchtitekturen Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore
More informationUnit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit
Unit A451: Computer systems and programming Section 2: Computing Hardware 1/5: Central Processing Unit Section Objectives Candidates should be able to: (a) State the purpose of the CPU (b) Understand the
More informationParallel Computing with MATLAB
Parallel Computing with MATLAB Scott Benway Senior Account Manager Jiro Doke, Ph.D. Senior Application Engineer 2013 The MathWorks, Inc. 1 Acceleration Strategies Applied in MATLAB Approach Options Best
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Amin Safi Faculty of Mathematics, TU dortmund January 22, 2016 Table of Contents Set
More informationOptimizing Application Performance with CUDA Profiling Tools
Optimizing Application Performance with CUDA Profiling Tools Why Profile? Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU 100 s of cores 10,000 s of threads Great memory
More informationThe High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
More informationAccelerating Intensity Layer Based Pencil Filter Algorithm using CUDA
Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Amol
More informationCourse Development of Programming for General-Purpose Multicore Processors
Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 wzhang4@vcu.edu
More informationOpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
More information1. If we need to use each thread to calculate one output element of a vector addition, what would
Quiz questions Lecture 2: 1. If we need to use each thread to calculate one output element of a vector addition, what would be the expression for mapping the thread/block indices to data index: (A) i=threadidx.x
More informationMaking Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationAn examination of the dual-core capability of the new HP xw4300 Workstation
An examination of the dual-core capability of the new HP xw4300 Workstation By employing single- and dual-core Intel Pentium processor technology, users have a choice of processing power options in a compact,
More informationMapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
More informationPerformance Measurement of a High-Performance Computing System Utilized for Electronic Medical Record Management
Performance Measurement of a High-Performance Computing System Utilized for Electronic Medical Record Management 1 Kiran George, 2 Chien-In Henry Chen 1,Corresponding Author Computer Engineering Program,
More informationCUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing
CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @
More informationNVIDIA Tools For Profiling And Monitoring. David Goodwin
NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale
More informationExperiences on using GPU accelerators for data analysis in ROOT/RooFit
Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,
More informationCUDA Basics. Murphy Stein New York University
CUDA Basics Murphy Stein New York University Overview Device Architecture CUDA Programming Model Matrix Transpose in CUDA Further Reading What is CUDA? CUDA stands for: Compute Unified Device Architecture
More informationME964 High Performance Computing for Engineering Applications
ME964 High Performance Computing for Engineering Applications Intro, GPU Computing February 9, 2012 Dan Negrut, 2012 ME964 UW-Madison "The Internet is a great way to get on the net. US Senator Bob Dole
More informationEvaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
More informationLecture 1: an introduction to CUDA
Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming
More informationOn-Demand Supercomputing Multiplies the Possibilities
Microsoft Windows Compute Cluster Server 2003 Partner Solution Brief Image courtesy of Wolfram Research, Inc. On-Demand Supercomputing Multiplies the Possibilities Microsoft Windows Compute Cluster Server
More informationWeek 1 out-of-class notes, discussions and sample problems
Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types
More information10- High Performance Compu5ng
10- High Performance Compu5ng (Herramientas Computacionales Avanzadas para la Inves6gación Aplicada) Rafael Palacios, Fernando de Cuadra MRE Contents Implemen8ng computa8onal tools 1. High Performance
More informationGPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
More information