Multi Grid for Multi Core
|
|
- Amie Martin
- 7 years ago
- Views:
Transcription
1 Multi Grid for Multi Core Harald Köstler, Daniel Ritter, Markus Stürmer and U. Rüde (LSS Erlangen, in collaboration with many more Lehrstuhl für Informatik 10 (Systemsimulation) Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de Seattle, February 2010 SIAM Parallel Processing
2 Motivation Overview Architectures: (Clusters of) Multi-Core CPUs Standard architecture IBM Cell GPU Algorithms: Multigrid Performance Engineering Porting multigrid to GPUs Local memory blocking techniques for multi-core CPUs Conclusions 2
3 Evolution of processors: Improvements pipelining superscalar execution out-of-order wider buses SIMD multithreading multiprocessing caches hardware prefetcher instruction data thread transfer local storage " Std.-CPU CBEA GPU!!!!! "! " "!!!!!!!! / "!!!!!! / "!!? / " " " /!! resource virtualization!! / " "
4 Cell/B.E. and PowerXCell 8i 4
5 Nvidia GeForce GTX 295 Costs: 450! Interface: PCI-E 2.0 x16 Shader Clock: 1242 MHz Memory Clock: 999 MHz Memory Bandwidth: 2x112 GB/s FLOPS: 2x894 GFLOPS Max Power Draw: 289 W Framebuffer: 2x896 MB Memory Bus: 2x448 bit Shader Processors: 2x240 5
6 ATI Radeon HD 4870 Costs: 150! Interface: PCI-E 2.0 x16 Shader Clock: 750 MHz Memory Clock: 900 MHz Memory Bandwidth: 115 GB/s FLOPS: 1200 GFLOPS Max Power Draw: 160 W Framebuffer: 1024 MB Memory Bus: 256 bit Shader Processors: 800 6
7 Performance of Multigrid for image processing on GPUs 7
8 iterative solvers only remove local components of the error fast instead: smooth remove local features of error restrict coarsen error compute correction of error on coarse grid prolongation interpolate and apply correction smooth again The Multigrid Idea 8
9 Full Multigrid Cycle Smoothing V-cycle Interpolation of solution u Exact solution Restriction of residual r = f - Au Interpolation of error and correction of solution u Computer Science X - System Simulation Group Harald Köstler (harald.koestler@informatik.uni-erlangen.de) 9
10 Multigrid on Nvidia GeForce GTX 295 Runtime V(2,2) in ms Image size Computer Science X - System Simulation Group Harald Köstler (harald.koestler@informatik.uni-erlangen.de) 10
11 Multigrid on ATI Radeon HD 4870 Runtime V(2,2) in ms x x x4096 Image size Computer Science X - System Simulation Group Harald Köstler (harald.koestler@informatik.uni-erlangen.de) 11
12 Runtime Comparison Nvidia GeForce GTX 295 Half of the GPU Memory bandwidth 112 GB/s Runtime 41,5 ms (4096 x 4096) ATI Radeon HD 4870 Memory bandwidth 115 GB/s Runtime 40,4 ms (4096 x 4096) " Both cards show very similar performance Computer Science X - System Simulation Group Harald Köstler (harald.koestler@informatik.uni-erlangen.de) 12
13 Red-black Splitting Store red and black values in two different arrays Doubles the performance Computer Science X - System Simulation Group Harald Köstler (harald.koestler@informatik.uni-erlangen.de) 13
14 Multigrid on GTX 295 with red-black splitting Runtime V(2,2) in ms Image size Computer Science X - System Simulation Group Harald Köstler (harald.koestler@informatik.uni-erlangen.de) 14
15 Memory Bandwidth Percent of memory bandwidth 100% 80% 60% 40% 20% 0% Image size In percent from maximum measured (rounded) streaming bandwidth (100 GB / s) Computer Science X - System Simulation Group Harald Köstler (harald.koestler@informatik.uni-erlangen.de) 15
16 Runtime Distribution for GPU Kernels memcopy 29% 2ndDerivative 3% RBGS 52% Gradients 3% Interpolate_co rr Residual_Rest 6% rict 7% Computer Science X - System Simulation Group Harald Köstler (harald.koestler@informatik.uni-erlangen.de) 16
17 Frames per second for Image Stitching fps (stitching GTX 295) fps (solver GTX 295) fps (CPU) x x x x x4096 CPU: Intel Core2 Quad Q9550@2.83GHz with with OpenMP OpenMP (4 cores) (4 cores) Computer Science X - System Simulation Group Harald Köstler (harald.koestler@informatik.uni-erlangen.de) 17
18 Leaping the memory wall: cache blocking techniques 18
19 Leaping the memory wall: local storage cache blocking techniques 19
20 Cache Blocking: leaping over the memory wall Idea: Change the order of operations to increase cache locality matrix-matrixmultiplication blocking: multiplications of sub-matrices that can be performed completely in-cache each stencil-based kernels Example: 5-point-stencil without blocking 20
21 Cache Blocking: leaping over the memory wall Idea: Change the order of operations to increase cache locality matrix-matrixmultiplication add multiplications of sub-matrices that can be performed completely in-cache each stencil-based kernels Example: 5-point-stencil without blocking 21
22 Cache Blocking: leaping over the memory wall Idea: Change the order of operations to increase cache locality matrix-matrixmultiplication add multiplications of sub-matrices that can be performed completely in-cache each stencil-based kernels Example: 5-point-stencil spatial blocking 22
23 Cache Blocking: leaping over the memory wall Idea: Change the order of operations to increase cache locality matrix-matrixmultiplication add multiplications of sub-matrices that can be performed completely in-cache each stencil-based kernels Example: 5-point-stencil spatial blocking 23
24 Cache Blocking: leaping over the memory wall Idea: Change the order of operations to increase cache locality matrix-matrixmultiplication add multiplications of sub-matrices that can be performed completely in-cache each stencil-based kernels Example: 5-point-stencil temporal blocking 24
25 Cache Blocking: leaping over the memory wall Idea: Change the order of operations to increase cache locality matrix-matrixmultiplication add multiplications of sub-matrices that can be performed completely in-cache each stencil-based kernels Example: 5-point-stencil temporal blocking 25
26 Temporal LS-blocking of 3D Stencil Code? options for temporal cache blocking in parallel synchronize data accesses at boundaries " compute required intermediate results locally! waste on block-boundaries (high surface/volume ratio for small LS) alignment and size of DMA and SIMD constraint block size low potential (estimate: max. 150% for 2# temporal blocking)»not even Markus would like to program that!«26
27 An approach for local storage blocking 27
28 Buffered blocking buffer structures (local storage) source grid (memory) target grid (memory) 28
29 Buffered blocking can hold one tile data 1 stripe per SPE at a time tile dependencies 29
30 Buffered blocking! " 30
31 Buffered blocking! " # 31
32 Buffered blocking! # " 32
33 Buffered blocking! can also be used with caches supporting (hybrid) framework feasible " 33
34 Framework architecture Framework Thread managment creation, synchronization, affinity Data structures alignment, padding (optional) Control traversal of grid distribution of work, calling of library kernels Application Code General configuration threads and their properties, type of per-thread buffers Setup of shared data Transfer of control grid size, constraints, number of sweeps Kernels storage2buffer( tiledesc&,buf& ) compute( bufferstack& ) buffer2storage( tiledesc&,buf& ) 34
35 Buffered blocking in action: Multigrid Method for Complex Diffusion 35
36 Image smoothing by complex diffusion smoothing image by solving a nonlinear, complex diffusion equation imaginary part works as an edge detector full approximation scheme multigrid solver simple variant of Schrödinger equation 36
37 FAS for Complex Diffusion ~400/450 flop/unknown on each grid level compute RHS (except finest level) two $-Jacobi iterations compute residual restrict current solution and residual two $-Jacobi iterations interpolate and apply correction 37
38 Framework performance results 400 V(2,2)-cycle for 4096#4096 image time [ms] 200 more than 36 GB/s 110 GFLOPS without blocking 0 2! Core2* Core i7** QS22 (16 SPEs) GTX 295 (half) straight-forward C++ implementation with OpenMP on Core architectures >1s * Core2 Xeon 2.8 GHz ** Core 2.93 GHz 38
39 Conclusions and Outlook 39
40 What else do we do? Parallel Multigrid Algorithms on 10,000 cores and beyond Talk by UR this afternoon (5:00pm) in MS 16 Challenges in Parallel Adaptive Mesh Refinement Parallel Rigid Body Dynamics MS 54 Friday 1:20-3:30 pm Talk by Klaus Iglberger on Friday 1:20 pm Poster by Tobias Preclik Parallel Rigid Body Dynamics Graduate Education in CS&E Talk by UR this afternoon (2:50 pm) in MS 11 Graduate Education for the Parallel Revolution Parallel Lattice Boltzmann Methods for Complex Flows no talk Performance Analysis: Talk by Georg Hager (Erlangen Computing Center) in MS 45, Friday 9:50-11:50 am Analysis of Hybrid Applications on Modern Architectures 40
41 Granular Flows with Non-Spherical Particles and Frictional Elastic Collisions 64 Processes, particles, each composed of 2-5 overlapping spheres, approx. 13 hours runtime D.M. Kaufman, T. Edmunds, and D.K. Pai: Fast frictional dynamics for rigid bodies. ACM Transactions on Graphics 24: ,
42 Thanks for your attention! Questions? Slides, reports, thesis, animations available for download at: www10.informatik.uni-erlangen.de 42
Fast Parallel Algorithms for Computational Bio-Medicine
Fast Parallel Algorithms for Computational Bio-Medicine H. Köstler, J. Habich, J. Götz, M. Stürmer, S. Donath, T. Gradl, D. Ritter, D. Bartuschat, C. Feichtinger, C. Mihoubi, K. Iglberger (LSS Erlangen)
More informationHardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
More informationwalberla: A software framework for CFD applications on 300.000 Compute Cores
walberla: A software framework for CFD applications on 300.000 Compute Cores J. Götz (LSS Erlangen, jan.goetz@cs.fau.de), K. Iglberger, S. Donath, C. Feichtinger, U. Rüde Lehrstuhl für Informatik 10 (Systemsimulation)
More informationGPGPU Computing. Yong Cao
GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
More informationwalberla: A software framework for CFD applications
walberla: A software framework for CFD applications U. Rüde, S. Donath, C. Feichtinger, K. Iglberger, F. Deserno, M. Stürmer, C. Mihoubi, T. Preclic, D. Haspel (all LSS Erlangen), N. Thürey (LSS Erlangen/
More informationLBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationGPU Architecture. Michael Doggett ATI
GPU Architecture Michael Doggett ATI GPU Architecture RADEON X1800/X1900 Microsoft s XBOX360 Xenos GPU GPU research areas ATI - Driving the Visual Experience Everywhere Products from cell phones to super
More informationGraphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationAccelerating CFD using OpenFOAM with GPUs
Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide
More informationIntroduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it
t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
More informationwalberla: Towards an Adaptive, Dynamically Load-Balanced, Massively Parallel Lattice Boltzmann Fluid Simulation
walberla: Towards an Adaptive, Dynamically Load-Balanced, Massively Parallel Lattice Boltzmann Fluid Simulation SIAM Parallel Processing for Scientific Computing 2012 February 16, 2012 Florian Schornbaum,
More informationPerformance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
More informationTowards real-time image processing with Hierarchical Hybrid Grids
Towards real-time image processing with Hierarchical Hybrid Grids International Doctorate Program - Summer School Björn Gmeiner Joint work with: Harald Köstler, Ulrich Rüde August, 2011 Contents The HHG
More informationIntroduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
More informationFRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG
FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG INSTITUT FÜR INFORMATIK (MATHEMATISCHE MASCHINEN UND DATENVERARBEITUNG) Lehrstuhl für Informatik 10 (Systemsimulation) Massively Parallel Multilevel Finite
More informationThe Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationImplementation of Canny Edge Detector of color images on CELL/B.E. Architecture.
Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture. Chirag Gupta,Sumod Mohan K cgupta@clemson.edu, sumodm@clemson.edu Abstract In this project we propose a method to improve
More informationHome Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015
INF5063: Programming heterogeneous multi-core processors because the OS-course is just to easy! Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks October 20 th 2015 Håkon Kvale
More informationIntroduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
More informationOpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
More informationThe Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA
The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques
More informationCUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
More informationRecent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005
Recent Advances and Future Trends in Graphics Hardware Michael Doggett Architect November 23, 2005 Overview XBOX360 GPU : Xenos Rendering performance GPU architecture Unified shader Memory Export Texture/Vertex
More informationTurbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationMulticore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationCUDA for Real Time Multigrid Finite Element Simulation of
CUDA for Real Time Multigrid Finite Element Simulation of SoftTissue Deformations Christian Dick Computer Graphics and Visualization Group Technische Universität München, Germany Motivation Real time physics
More informationOptimizing a 3D-FWT code in a cluster of CPUs+GPUs
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la
More informationIntroduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software
GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
More informationIntroducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
More informationMulti-GPU Load Balancing for Simulation and Rendering
Multi- Load Balancing for Simulation and Rendering Yong Cao Computer Science Department, Virginia Tech, USA In-situ ualization and ual Analytics Instant visualization and interaction of computing tasks
More informationHIGH PERFORMANCE CONSULTING COURSE OFFERINGS
Performance 1(6) HIGH PERFORMANCE CONSULTING COURSE OFFERINGS LEARN TO TAKE ADVANTAGE OF POWERFUL GPU BASED ACCELERATOR TECHNOLOGY TODAY 2006 2013 Nvidia GPUs Intel CPUs CONTENTS Acronyms and Terminology...
More informationultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
More informationUsing Many-Core Hardware to Correlate Radio Astronomy Signals
Using Many-Core Hardware to Correlate Radio Astronomy Signals Rob V. van Nieuwpoort ASTRON, Netherlands Institute for Radio Astronomy Dwingeloo, The Netherlands nieuwpoort@astron.nl Categories and Subject
More informationIntroduction to GPU Computing
Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture
More informationComputer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
More informationLecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU
More informationStream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
More informationDesign and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
More informationHigh Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates
High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of
More informationGPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
More informationHardware-Aware Analysis and Optimization of Stable Fluids
Hardware-Aware Analysis and Optimization of Stable Fluids Theodore Kim IBM TJ Watson Research Center Abstract We perform a detailed flop and bandwidth analysis of Jos Stam s Stable Fluids algorithm on
More informationIP Video Rendering Basics
CohuHD offers a broad line of High Definition network based cameras, positioning systems and VMS solutions designed for the performance requirements associated with critical infrastructure applications.
More information1. INTRODUCTION Graphics 2
1. INTRODUCTION Graphics 2 06-02408 Level 3 10 credits in Semester 2 Professor Aleš Leonardis Slides by Professor Ela Claridge What is computer graphics? The art of 3D graphics is the art of fooling the
More informationand RISC Optimization Techniques for the Hitachi SR8000 Architecture
1 KONWIHR Project: Centre of Excellence for High Performance Computing Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture F. Deserno, G. Hager, F. Brechtefeld, G.
More informationGPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith
GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series By: Binesh Tuladhar Clay Smith Overview History of GPU s GPU Definition Classical Graphics Pipeline Geforce 6 Series Architecture Vertex
More informationCell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine
Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Ashwin Aji, Wu Feng, Filip Blagojevic and Dimitris Nikolopoulos Forecast Efficient mapping of wavefront algorithms
More informationGPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
More informationRadeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008
Radeon GPU Architecture and the series Michael Doggett Graphics Architecture Group June 27, 2008 Graphics Processing Units Introduction GPU research 2 GPU Evolution GPU started as a triangle rasterizer
More informationAssessing the Performance of OpenMP Programs on the Intel Xeon Phi
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum
More informationReal-time Visual Tracker by Stream Processing
Real-time Visual Tracker by Stream Processing Simultaneous and Fast 3D Tracking of Multiple Faces in Video Sequences by Using a Particle Filter Oscar Mateo Lozano & Kuzahiro Otsuka presented by Piotr Rudol
More informationChoosing a Computer for Running SLX, P3D, and P5
Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line
More informationHPC enabling of OpenFOAM R for CFD applications
HPC enabling of OpenFOAM R for CFD applications Towards the exascale: OpenFOAM perspective Ivan Spisso 25-27 March 2015, Casalecchio di Reno, BOLOGNA. SuperComputing Applications and Innovation Department,
More informationMixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State
More informationGeneral Purpose Computation on Graphics Processors (GPGPU) Mike Houston, Stanford University
General Purpose Computation on Graphics Processors (GPGPU) Mike Houston, Stanford University A little about me http://graphics.stanford.edu/~mhouston Education: UC San Diego, Computer Science BS Stanford
More informationProgramming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture
Programming Techniques for Supercomputers: Multicore processors There is no way back Modern multi-/manycore chips Basic ompute Node Architecture SimultaneousMultiThreading (SMT) Prof. Dr. G. Wellein (a,b),
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
More informationScalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age
Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Xuan Shi GRA: Bowei Xue University of Arkansas Spatiotemporal Modeling of Human Dynamics
More informationATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group
ATI Radeon 4800 series Graphics Michael Doggett Graphics Architecture Group Graphics Product Group Graphics Processing Units ATI Radeon HD 4870 AMD Stream Computing Next Generation GPUs 2 Radeon 4800 series
More informationHardware design for ray tracing
Hardware design for ray tracing Jae-sung Yoon Introduction Realtime ray tracing performance has recently been achieved even on single CPU. [Wald et al. 2001, 2002, 2004] However, higher resolutions, complex
More informationBig Data Visualization on the MIC
Big Data Visualization on the MIC Tim Dykes School of Creative Technologies University of Portsmouth timothy.dykes@port.ac.uk Many-Core Seminar Series 26/02/14 Splotch Team Tim Dykes, University of Portsmouth
More informationThe High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
More informationSPEEDUP - optimization and porting of path integral MC Code to new computing architectures
SPEEDUP - optimization and porting of path integral MC Code to new computing architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić, A. Bogojević Scientific Computing Laboratory, Institute of Physics
More informationPedraforca: ARM + GPU prototype
www.bsc.es Pedraforca: ARM + GPU prototype Filippo Mantovani Workshop on exascale and PRACE prototypes Barcelona, 20 May 2014 Overview Goals: Test the performance, scalability, and energy efficiency of
More informationFast Implementations of AES on Various Platforms
Fast Implementations of AES on Various Platforms Joppe W. Bos 1 Dag Arne Osvik 1 Deian Stefan 2 1 EPFL IC IIF LACAL, Station 14, CH-1015 Lausanne, Switzerland {joppe.bos, dagarne.osvik}@epfl.ch 2 Dept.
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationConfiguring Memory on the HP Business Desktop dx5150
Configuring Memory on the HP Business Desktop dx5150 Abstract... 2 Glossary of Terms... 2 Introduction... 2 Main Memory Configuration... 3 Single-channel vs. Dual-channel... 3 Memory Type and Speed...
More informationReal-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university
Real-Time Realistic Rendering Michael Doggett Docent Department of Computer Science Lund university 30-5-2011 Visually realistic goal force[d] us to completely rethink the entire rendering process. Cook
More informationImplementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration
Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Jinglin Zhang, Jean François Nezan, Jean-Gabriel Cousin, Erwan Raffin To cite this version: Jinglin Zhang,
More informationLS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
More informationTexture Cache Approximation on GPUs
Texture Cache Approximation on GPUs Mark Sutherland Joshua San Miguel Natalie Enright Jerger {suther68,enright}@ece.utoronto.ca, joshua.sanmiguel@mail.utoronto.ca 1 Our Contribution GPU Core Cache Cache
More informationHigh performance computing and depth imaging the way to go? Henri Calandra, Rached Abdelkhalek, Laurent Derrien Outline introduction to seismic depth imaging Seismic exploration Challenges Looking for
More informationGeoImaging Accelerator Pansharp Test Results
GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance
More informationGPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
More informationGPU Computing - CUDA
GPU Computing - CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective
More informationImproving Grid Processing Efficiency through Compute-Data Confluence
Solution Brief GemFire* Symphony* Intel Xeon processor Improving Grid Processing Efficiency through Compute-Data Confluence A benchmark report featuring GemStone Systems, Intel Corporation and Platform
More informationSequential Performance Analysis with Callgrind and KCachegrind
Sequential Performance Analysis with Callgrind and KCachegrind 4 th Parallel Tools Workshop, HLRS, Stuttgart, September 7/8, 2010 Josef Weidendorfer Lehrstuhl für Rechnertechnik und Rechnerorganisation
More information~ Greetings from WSU CAPPLab ~
~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)
More informationClustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu ren.wu@hp.com Bin Zhang bin.zhang2@hp.com Meichun Hsu meichun.hsu@hp.com ABSTRACT In this paper, we report our research on using GPUs to accelerate
More informationCPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1
CPU Session 1 Praktikum Parallele Rechnerarchtitekturen Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationSequential Performance Analysis with Callgrind and KCachegrind
Sequential Performance Analysis with Callgrind and KCachegrind 2 nd Parallel Tools Workshop, HLRS, Stuttgart, July 7/8, 2008 Josef Weidendorfer Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut
More informationA Fast Double Precision CFD Code using CUDA
A Fast Double Precision CFD Code using CUDA Jonathan M. Cohen *, M. Jeroen Molemaker** *NVIDIA Corporation, Santa Clara, CA 95050, USA (e-mail: jocohen@nvidia.com) **IGPP UCLA, Los Angeles, CA 90095, USA
More informationKeys to node-level performance analysis and threading in HPC applications
Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION
More informationAchieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
More informationWhy You Need the EVGA e-geforce 6800 GS
Why You Need the EVGA e-geforce 6800 GS GeForce 6800 GS Profile NVIDIA s announcement of a new GPU product hailing from the now legendary GeForce 6 series adds new fire to the lineup in the form of the
More informationHigh Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
More informationSUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS - 2013 UPDATE
SUBJECT: SOLIDWORKS RECOMMENDATIONS - 2013 UPDATE KEYWORDS:, CORE, PROCESSOR, GRAPHICS, DRIVER, RAM, STORAGE SOLIDWORKS RECOMMENDATIONS - 2013 UPDATE Below is a summary of key components of an ideal SolidWorks
More informationYALES2 porting on the Xeon- Phi Early results
YALES2 porting on the Xeon- Phi Early results Othman Bouizi Ghislain Lartigue Innovation and Pathfinding Architecture Group in Europe, Exascale Lab. Paris CRIHAN - Demi-journée calcul intensif, 16 juin
More informationMixing Multi-Core CPUs and GPUs for Scientific Simulation Software
SUBMITTED TO IEEE TRANS. ON PARALLEL AND DISTRIBUTED SYSTEMS 1 Mixing Multi-Core CPUs and GPUs for Scientific Simulation Software K.A. Hawick, Member, IEEE, A. Leist, and D.P. Playne Abstract Recent technological
More informationRay Tracing on Graphics Hardware
Ray Tracing on Graphics Hardware Toshiya Hachisuka University of California, San Diego Abstract Ray tracing is one of the important elements in photo-realistic image synthesis. Since ray tracing is computationally
More informationEVALUATION OF MULTI-CORE ARCHITECTURES FOR IMAGE PROCESSING ALGORITHMS
EVALUATION OF MULTI-CORE ARCHITECTURES FOR IMAGE PROCESSING ALGORITHMS A Thesis Presented to the Graduate School of Clemson University In Partial Fulfillment of the Requirements for the Degree Master of
More informationCosmological Simulations on Large, Heterogeneous Supercomputers
Cosmological Simulations on Large, Heterogeneous Supercomputers Adrian Pope (LANL) Cosmology on the Beach January 10, 2011 Slide 1 People LANL: Jim Ahrens, Lee Ankeny, Suman Bhattacharya, David Daniel,
More information