~ Greetings from WSU CAPPLab ~

Size: px
Start display at page:

Download "~ Greetings from WSU CAPPLab ~"

Transcription

1 ~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU) Computer Architecture & Parallel Programming Laboratory (CAPPLab) Wichita, Kansas, USA Prepared on: November 21, 2012

2 Outline Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Introduction (Juggling) (Ultimate) Performance Multicore Architectures, Simultaneous Multithreading (SMT) Multicore with SMT provides the ultimate performance T/F? CAPPLab Researchers, Resources Research Activities (Multicore with SMT plus GPGPU/CUDA Technology) Discussion Contact Information QUESTIONS? Any time, please! Dr. Zaman 2

3 Introduction Presenters Dr. Abu Asaduzzaman Asst. Prof., Elec. Eng. & Computer Sci. Dept., WSU Director, WSU Computer Arch & Parallel Prog Lab (CAPPLab) (Juggling) Dr. Zaman 3

4 Performance (Single-Core to) Multicore Architecture History of Computing Word computer in 1613 (this is not the beginning) Von Neumann architecture (1945) data/instructions memory Harvard architecture (1944) data memory, instruction memory Single-Core Processors In most modern processors: split CL1 (I1, D1), unified CL2, Intel Pentium 4, AMD Athlon Classic, Popular Programming Languages C, Dr. Zaman 4

5 Performance (Single-Core to) Multicore Architecture Cache not shown Input Process/Store Output Multi-tasking Time sharing (Juggling!) Courtesy: Jernej Barbič, Carnegie Mellon University Dr. Zaman 5

6 Single-Core Core Performance A thread is a running process a single core Courtesy: Jernej Barbič, Carnegie Mellon University Dr. Zaman 6

7 Performance Major Steps to Execute an Instruction CPU and Memory 1: I.F. (5) W.B. (3) O.F. 16b D7 D0 Data Registers Memory CPU A7 A7 A0 Address Registers PC b (3) O.F. (5) W.B. 24b Start SR : I.E. ALU IR?? Decoder / Control Unit 2: I.D. Dr. Zaman 7

8 Performance Thread 1: Integer (INT) (Pipelining Technique) Thread 1: Integer 4: Integer 1: Instruction Fetch 2: Instruction Decode (3) Operand(s) Fetch Arithmetic Logic Unit (5) Result Write Back Floating Point Dr. Zaman 8

9 Performance Thread 2: Floating Point (FP) (Pipelining Technique) Integer Instruction Fetch Instruction Decode Operand(s) Fetch Arithmetic Logic Unit Result Write Back Thread 2: Floating Point Floating Point Dr. Zaman 9

10 Performance Threads 1 and 2: INT and FP s (Pipelining Technique) Thread 1: Integer Integer POSSIBLE? Instruction Fetch Instruction Decode Operand(s) Fetch Arithmetic Logic Unit Result Write Back Thread 2: Floating Point Floating Point Dr. Zaman 10

11 Performance Threads 1 and 3: Integer s Thread 1: Integer Thread 3: Integer POSSIBLE? Integer Instruction Fetch Instruction Decode Operand(s) Fetch Arithmetic Logic Unit Result Write Back Floating Point Dr. Zaman 11

12 Performance Threads 1 and 3: Integer s (Multicore) Thread 1: Integer Integer Instruction Fetch Instruction Decode Operand(s) Fetch Arithmetic Logic Unit Result Write Back Core 1 Floating Point POSSIBLE? Thread 3: Integer Integer Instruction Fetch Core 2 Instruction Decode Operand(s) Fetch Dr. Zaman 12 Arithmetic Logic Unit Floating Point Result Write Back

13 Performance Threads 1, 2, 3, and 4: INT & FP s (Multicore) Instruction Fetch Core 1 Instruction Fetch Core 2 Thread 1: Integer Instruction Decode Thread 2: Floating Point Thread 3: Integer Instruction Decode Operand(s) Fetch Operand(s) Fetch Thread 4: Floating Point Dr. Zaman 13 Integer Arithmetic Logic Unit Floating Point Integer Arithmetic Logic Unit Floating Point Result Write Back Result Write Back POSSIBLE?

14 Performance Simultaneous Multithreading (SMT) Thread A running program (or code segment) is a process Process processes / threads Multithreading (IP4 Hyper-threading) Multiple threads running in a single-processor (time sharing) Simultaneous Multithreading (SMT) Multiple threads running in a single-processor at the same time Generating/Managing Multiple Threads OpenMP, Open MPI C Dr. Zaman 14

15 Performance Multicore Architecture Single-Core Processors Multiprocessor and Multicomputer Systems Multiple processors: shared/common memory, local memory Multiple processors: own private memory Multicore Processors Multiple cores on a single chip Working together, sharing resources Multicore Programming Language supports OpenMP, Open MPI C Dr. Zaman 15

16 Performance Parallel/Concurrent Computing Parallel Processing It is not fun! Paying the lunch bill together Friend Before Eating Total Bill Return Tip After Paying A $10 $1 B $10 $25 $5 $2 $1 C $10 $1 Total $30 $2 Total Spent $9 $9 $9 $27 Started with $30; spent $29 ($27 + $2) Where did $1 go? (Juggling!) Dr. Zaman 16

17 Ultimate Performance Multicore with SMT is it enough? Example: Matrix Multiplication [C] = [A] [B] 2 x 2 Matrix 8 (i.e., 2 * 2^2) multiplications 4 (i.e., 1 * 2^2) additions Dr. Zaman 17

18 Ultimate Performance Multicore with SMT is it enough? Example: Matrix Multiplication [C] = [A] [B] 3 x 3 Matrix; how many multiplications and additions? 27 (i.e., 3 * 3^2 i.e., 3^3) multiplications 18 (i.e., 2 * 3^2 i.e., (3 1) * 3^2) additions Dr. Zaman 18

19 Ultimate Performance Algorithm Design Techniques Example: Matrix Multiplication [C] = [A] [B] = 4 x 4 Matrix ; how many multiplications and additions? 64 (i.e., 4^3) multiplications N^3 multiplications 48 (i.e., 3 * 4^2) additions (N 1)N^2 additions Dr. Zaman 19

20 Ultimate Performance Algorithm Design Techniques Example: Matrix Multiplication 4 x 4 Matrix 64 (i.e., 4^3) multiplications 48 (i.e., 3 * 4^2) additions A A 1,1 A 1,2 B 2 x 2 Matrix 8 (i.e., 2^3) multiplications 4 (i.e., 1 * 2^2) additions Are we reducing *s/+s? What is the message? Dr. Zaman 20

21 Algorithm Design Techniques Example: Matrix Multiplication [C] = [A] [B] Ultimate Performance Say, we have unlimited 2 x 2 Matrix solvers with 8 MULT Then it takes only 2 * 8 MULT time unit Do we have unlimited solvers/cores? Dr. Zaman 21

22 Ultimate Performance GPGPU/CUDA Technology GPGPU General-Purpose computing on Graphics Processing Units (GPGPU, GPGP or less often GP²U). More for scientific usages. GPU Graphics Processing Units. Mainly for multimedia usages. CUDA CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA. Provides GPGPU programing interface. Dr. Zaman 22

23 Ultimate Performance GPGPU/CUDA Technology (Looking back: PCI) PCI (Peripheral Component Interconnect) CPU PCI Peripherals CPU PCI-E Peripherals PCI Express (Peripheral Component Interconnect Express) One CPU, Multiple GPUs (Juggling) GPU CPU GPU GPU Dr. Zaman 23

24 Ultimate Performance GPGPU/CUDA Technology GPU (the chip itself) consists of group of Streaming Multiprocessors (SM) Inside each SM: 32 cores (sharing the same instruction) 64KB shared memory (shared among the 32 cores) 32K 32bit registers 2 warp schedulers (to schedule instructions) 4 special function units (Juggling) Dr. Zaman 24

25 Ultimate Performance GPGPU/CUDA Technology GPU (the chip itself) consists of group of Streaming Multiprocessors (SM) Inside each SM: 32 cores (sharing the same instruction) 64KB shared memory (shared among the 32 cores) 32K 32bit registers 2 warp schedulers (to schedule instructions) 4 special function units (Juggling) Dr. Zaman 25

26 Ultimate Performance GPGPU/CUDA Technology The host (CPU) executes a kernel in GPU in 4 steps (Step 1) CPU allocates and copies data to GPU On CUDA API: cudamalloc() cudamemcpy() Dr. Zaman 26

27 Ultimate Performance GPGPU/CUDA Technology The host (CPU) executes a kernel in GPU in 4 steps (Step 2) CPU Sends function parameters and instructions to GPU CUDA API: myfunc<<<blocks, Threads>>>(parameters) Dr. Zaman 27

28 Ultimate Performance GPGPU/CUDA Technology The host (CPU) executes a kernel in GPU in 4 steps (Step 3) GPU executes instruction as scheduled in warps (Step 4) Results will need to be copied back to Host memory (RAM) using cudamemcpy() Dr. Zaman 28

29 Ultimate Performance GPGPU/CUDA Technology CUDA Threads are grouped into blocks. This is to optimize the use of memory. Instruction sent by host to GPU is called a Kernel. GPU sees a kernel as a grid of blocks of threads. Dr. Zaman 29

30 Ultimate Performance GPGPU/CUDA Technology Each CUDA thread will execute in one core. Depending on memory requirements of a kernel, multiple block may execute on each SM. Each kernel can only be executed by one device (unless programmer s intervention). Multiple kernels may be executed at one time. Dr. Zaman 30

31 Ultimate Performance Case Study 1 (data independent computation without GPU/CUDA) Matrix Multiplication Matrices Systems Dr. Zaman 31

32 Ultimate Performance Case Study 1 (data independent computation without GPU/CUDA) Matrix Multiplication Execution Time Power Consumption Dr. Zaman 32

33 Ultimate Performance Case Study 2 (data dependent computation without GPU/CUDA) Heat Transfer on 2D Surface Execution Time Power Consumption Dr. Zaman 33

34 Ultimate Performance Case Study 3 (data dependent computation with GPU/CUDA) Lightning Strike Protection (LSP) Dr. Zaman 34

35 Ultimate Performance Case Study 3 (data dependent computation with GPU/CUDA) Fast Effective LSP Simulation Many aerospace companies have incorporated fiber-reinforced composite materials into the fuselage, either partially or wholly because of the high strength-to-weight ratio, stiffness, and larger scale manufacturing abilities at any shape. However, the lack of lightning strike protection (LSP) for the composite materials limits their use in many applications. We propose a fast and effective simulation model using NVIDIA general purpose graphics processing unit (GPGPU) and compute unified device architecture (CUDA) technology which is targeted to LSP analysis on composite aircrafts. Dr. Zaman 35

36 Ultimate Performance Case Study 3 (data dependent computation with GPU/CUDA) Fast Effective LSP Simulation In many cases like lightning strikes on a composite material, when the charge distribution is not known, the Poisson's Equation can be used to solve any electrostatic problem. Using the Laplacian operator on the electric potential function over a region of the space where the charge density is not zero, the Poisson's Equation is: Dr. Zaman 36

37 Ultimate Performance Case Study 3 (data dependent computation with GPU/CUDA) Fast Effective LSP Simulation If the charge density is zero all over the region, the Poison's Equation becomes Laplace's Equation: A. Asaduzzaman, C. Yip, S. Kumar, and R. Asmatulu, Fast, Effective, and Adaptable Computer Modeling and Simulation of Lightning Strike Protection on Composite Materials, under preparation, IEEE SoutheastCon conference 2013, Jacksonville, Florida, April 4-7, Dr. Zaman 37

38 Ultimate Performance Case Study 3 (data dependent computation with GPU/CUDA) Fast Effective LSP Simulation Simulation CPU Only CPU/GPU w/o shared memory CPU/GPU with shared memory Dr. Zaman 38

39 Ultimate Performance Case Study 4 (data independent computation with GPU/CUDA) Quantum Computing On going Expecting collaboration with Dr. Kumar, EECS, WSU Other Areas Eco-Biological studies Medical studies More Dr. Zaman 39

40 Outline Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Introduction Juggling (Ultimate) Performance Multicore Architectures, Simultaneous Multithreading (SMT) Multicore with SMT provides the ultimate performance T/F? CAPPLab Researchers, Resources Research Activities (Multicore with SMT plus GPGPU/CUDA Technology) Discussion Contact Information QUESTIONS? Any time! Dr. Zaman 40

41 WSU CAPPLab CAPPLab Computer Architecture & Parallel Programming Laboratory (CAPPLab) Physical location: 245 Jabara Hall URL: Tel: WSU-3927 Key Objectives Lead research in advanced-level computer architecture, highperformance computing, embedded systems, and related fields. Educate advanced-level computer architecture and parallel programming. Dr. Zaman 41

42 WSU CAPPLab Researchers Faculty Members Dr. Abu Asaduzzaman, Asst. Prof., EECS, WSU Students Chok M. Yip, MS Student, EECS Dept. Nasrin Sultana, MS Student, EECS Dept. Zachary A. Vickers, BS Student, EECS Dept. Hin Yun Lee, MS in CS, EECS Dept. Others Dr. Ramazan Asmatulu, Asso. Prof., ME, WSU Dr. Preethika Kumar, Asst. Prof., EECS, WSU Dr. Zaman 42

43 WSU CAPPLab Resources Hardware 1: CUDA Server CPU: Xeon E5506, 2x 4-core, 2.13 GHz, 8GB DDR3; GPU: Telsa C2075, 14x 32 cores, 6GB GDDR5 memory 2: CUDA PC CPU: Xeon E5506, 3: Supercomputer (Opteron 6134, 32 cores per node, 2.3 GHz, 64 GB DDR3) via remote access to WSU (HiPeCC) 2 CUDA enabled Windows Workstations/PCs 1 CUDA enabled Laptops More Software As needed Dr. Zaman 43

44 WSU CAPPLab Past/Current Activities WSU became CUDA Teaching Center for Support from NVIDIA Teaching parallel programming Workshop GPGPU/CUDA/C Summer 2012 (10 participants) GPGPU/CUDA/C Summer 2013 Collaborative Research Dr. Ramazan Asmatulu Dr. Preethika Kumar More Dr. Zaman 44

45 WSU CAPPLab Past/Current/Future Activities Research Funding M2SYS-WSU Biometric Cloud Computing Research Project Teaching (Hardware/Financial) supports from NVIDIA Research Funding MURPA, pending, ORA, WSU NSF TUES Type-1, pending, NSF Preparing for NSF, and other external agencies Dr. Zaman 45

46 Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Thank You! Contact: Abu Asaduzzaman Phone:

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Parallel Computing Architecture and CUDA Programming Model GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel

More information

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007 Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU

More information

Program Grid and HPC5+ workshop

Program Grid and HPC5+ workshop Program Grid and HPC5+ workshop 24-30, Bahman 1391 Tuesday Wednesday 9.00-9.45 9.45-10.30 Break 11.00-11.45 11.45-12.30 Lunch 14.00-17.00 Workshop Rouhani Karimi MosalmanTabar Karimi G+MMT+K Opening IPM_Grid

More information

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department

More information

Introduction to GPU Computing

Introduction to GPU Computing Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture

More information

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

CUDA programming on NVIDIA GPUs

CUDA programming on NVIDIA GPUs p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and

More information

NVIDIA GeForce GTX 580 GPU Datasheet

NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines

More information

Introduction to GPU Programming Languages

Introduction to GPU Programming Languages CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

More information

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.

More information

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase

More information

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture? This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo

More information

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy

More information

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),

More information

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get

More information

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Seeking Opportunities for Hardware Acceleration in Big Data Analytics Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who

More information

CS 394 Introduction to Computer Architecture Spring 2012

CS 394 Introduction to Computer Architecture Spring 2012 CS 394 Introduction to Computer Architecture Spring 2012 Class Room/Hours: NA (Online course) Lab Room/Hours: NA Instructor: Abu Asaduzzaman (Dr. Zaman) Office Room: 253 Jabara Hall E-mail: Abu.Asaduzzaman@wichita.edu

More information

Enabling Technologies for Distributed and Cloud Computing

Enabling Technologies for Distributed and Cloud Computing Enabling Technologies for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Multi-core CPUs and Multithreading

More information

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,

More information

Generations of the computer. processors.

Generations of the computer. processors. . Piotr Gwizdała 1 Contents 1 st Generation 2 nd Generation 3 rd Generation 4 th Generation 5 th Generation 6 th Generation 7 th Generation 8 th Generation Dual Core generation Improves and actualizations

More information

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

More information

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:

More information

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná - Brazil Hochschule Regensburg 02.05.2011 Daniel

More information

High Performance Computing in CST STUDIO SUITE

High Performance Computing in CST STUDIO SUITE High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver

More information

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,

More information

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833

More information

GeoImaging Accelerator Pansharp Test Results

GeoImaging Accelerator Pansharp Test Results GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance

More information

GPU Computing - CUDA

GPU Computing - CUDA GPU Computing - CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective

More information

CS2101a Foundations of Programming for High Performance Computing

CS2101a Foundations of Programming for High Performance Computing CS2101a Foundations of Programming for High Performance Computing Marc Moreno Maza & Ning Xie University of Western Ontario, London, Ontario (Canada) CS2101 Plan 1 Course Overview 2 Hardware Acceleration

More information

Trends in High-Performance Computing for Power Grid Applications

Trends in High-Performance Computing for Power Grid Applications Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views

More information

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS Performance 1(6) HIGH PERFORMANCE CONSULTING COURSE OFFERINGS LEARN TO TAKE ADVANTAGE OF POWERFUL GPU BASED ACCELERATOR TECHNOLOGY TODAY 2006 2013 Nvidia GPUs Intel CPUs CONTENTS Acronyms and Terminology...

More information

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit. Objectives The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Identify the components of the central processing unit and how they work together and interact with memory Describe how

More information

Chapter 2 Parallel Computer Architecture

Chapter 2 Parallel Computer Architecture Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing

More information

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens

More information

Clustering Billions of Data Points Using GPUs

Clustering Billions of Data Points Using GPUs Clustering Billions of Data Points Using GPUs Ren Wu ren.wu@hp.com Bin Zhang bin.zhang2@hp.com Meichun Hsu meichun.hsu@hp.com ABSTRACT In this paper, we report our research on using GPUs to accelerate

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Discovering Computers 2011. Living in a Digital World

Discovering Computers 2011. Living in a Digital World Discovering Computers 2011 Living in a Digital World Objectives Overview Differentiate among various styles of system units on desktop computers, notebook computers, and mobile devices Identify chips,

More information

Embedded Systems: map to FPGA, GPU, CPU?

Embedded Systems: map to FPGA, GPU, CPU? Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven jos@vectorfabrics.com Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.

More information

GPU Architecture. Michael Doggett ATI

GPU Architecture. Michael Doggett ATI GPU Architecture Michael Doggett ATI GPU Architecture RADEON X1800/X1900 Microsoft s XBOX360 Xenos GPU GPU research areas ATI - Driving the Visual Experience Everywhere Products from cell phones to super

More information

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit

More information

ADVANCED COMPUTER ARCHITECTURE

ADVANCED COMPUTER ARCHITECTURE ADVANCED COMPUTER ARCHITECTURE Marco Ferretti Tel. Ufficio: 0382 985365 E-mail: marco.ferretti@unipv.it Web: www.unipv.it/mferretti, eecs.unipv.it 1 Course syllabus and motivations This course covers the

More information

Enabling Technologies for Distributed Computing

Enabling Technologies for Distributed Computing Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing, UNF Multi-core CPUs and Multithreading Technologies

More information

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization

More information

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,

More information

Let s put together a Manual Processor

Let s put together a Manual Processor Lecture 14 Let s put together a Manual Processor Hardware Lecture 14 Slide 1 The processor Inside every computer there is at least one processor which can take an instruction, some operands and produce

More information

Chapter 4 System Unit Components. Discovering Computers 2012. Your Interactive Guide to the Digital World

Chapter 4 System Unit Components. Discovering Computers 2012. Your Interactive Guide to the Digital World Chapter 4 System Unit Components Discovering Computers 2012 Your Interactive Guide to the Digital World Objectives Overview Differentiate among various styles of system units on desktop computers, notebook

More information

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop Emily Apsey Performance Engineer Presentation Overview What it takes to successfully virtualize ArcGIS Pro in Citrix XenApp and XenDesktop - Shareable

More information

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of

More information

Overview. CPU Manufacturers. Current Intel and AMD Offerings

Overview. CPU Manufacturers. Current Intel and AMD Offerings Central Processor Units (CPUs) Overview... 1 CPU Manufacturers... 1 Current Intel and AMD Offerings... 1 Evolution of Intel Processors... 3 S-Spec Code... 5 Basic Components of a CPU... 6 The CPU Die and

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware

More information

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design Learning Outcomes Simple CPU Operation and Buses Dr Eddie Edwards eddie.edwards@imperial.ac.uk At the end of this lecture you will Understand how a CPU might be put together Be able to name the basic components

More information

Turbomachinery CFD on many-core platforms experiences and strategies

Turbomachinery CFD on many-core platforms experiences and strategies Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29

More information

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures 11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the

More information

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 CPU Session 1 Praktikum Parallele Rechnerarchtitekturen Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore

More information

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit Unit A451: Computer systems and programming Section 2: Computing Hardware 1/5: Central Processing Unit Section Objectives Candidates should be able to: (a) State the purpose of the CPU (b) Understand the

More information

Parallel Computing with MATLAB

Parallel Computing with MATLAB Parallel Computing with MATLAB Scott Benway Senior Account Manager Jiro Doke, Ph.D. Senior Application Engineer 2013 The MathWorks, Inc. 1 Acceleration Strategies Applied in MATLAB Approach Options Best

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Amin Safi Faculty of Mathematics, TU dortmund January 22, 2016 Table of Contents Set

More information

Optimizing Application Performance with CUDA Profiling Tools

Optimizing Application Performance with CUDA Profiling Tools Optimizing Application Performance with CUDA Profiling Tools Why Profile? Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU 100 s of cores 10,000 s of threads Great memory

More information

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

More information

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Amol

More information

Course Development of Programming for General-Purpose Multicore Processors

Course Development of Programming for General-Purpose Multicore Processors Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 wzhang4@vcu.edu

More information

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,

More information

1. If we need to use each thread to calculate one output element of a vector addition, what would

1. If we need to use each thread to calculate one output element of a vector addition, what would Quiz questions Lecture 2: 1. If we need to use each thread to calculate one output element of a vector addition, what would be the expression for mapping the thread/block indices to data index: (A) i=threadidx.x

More information

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

An examination of the dual-core capability of the new HP xw4300 Workstation

An examination of the dual-core capability of the new HP xw4300 Workstation An examination of the dual-core capability of the new HP xw4300 Workstation By employing single- and dual-core Intel Pentium processor technology, users have a choice of processing power options in a compact,

More information

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,

More information

Performance Measurement of a High-Performance Computing System Utilized for Electronic Medical Record Management

Performance Measurement of a High-Performance Computing System Utilized for Electronic Medical Record Management Performance Measurement of a High-Performance Computing System Utilized for Electronic Medical Record Management 1 Kiran George, 2 Chien-In Henry Chen 1,Corresponding Author Computer Engineering Program,

More information

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @

More information

NVIDIA Tools For Profiling And Monitoring. David Goodwin

NVIDIA Tools For Profiling And Monitoring. David Goodwin NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale

More information

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Experiences on using GPU accelerators for data analysis in ROOT/RooFit Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,

More information

CUDA Basics. Murphy Stein New York University

CUDA Basics. Murphy Stein New York University CUDA Basics Murphy Stein New York University Overview Device Architecture CUDA Programming Model Matrix Transpose in CUDA Further Reading What is CUDA? CUDA stands for: Compute Unified Device Architecture

More information

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications ME964 High Performance Computing for Engineering Applications Intro, GPU Computing February 9, 2012 Dan Negrut, 2012 ME964 UW-Madison "The Internet is a great way to get on the net. US Senator Bob Dole

More information

Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

More information

Lecture 1: an introduction to CUDA

Lecture 1: an introduction to CUDA Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming

More information

On-Demand Supercomputing Multiplies the Possibilities

On-Demand Supercomputing Multiplies the Possibilities Microsoft Windows Compute Cluster Server 2003 Partner Solution Brief Image courtesy of Wolfram Research, Inc. On-Demand Supercomputing Multiplies the Possibilities Microsoft Windows Compute Cluster Server

More information

Week 1 out-of-class notes, discussions and sample problems

Week 1 out-of-class notes, discussions and sample problems Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types

More information

10- High Performance Compu5ng

10- High Performance Compu5ng 10- High Performance Compu5ng (Herramientas Computacionales Avanzadas para la Inves6gación Aplicada) Rafael Palacios, Fernando de Cuadra MRE Contents Implemen8ng computa8onal tools 1. High Performance

More information

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015 GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once

More information