High Performance CUDA Accelerated Local Optimization in Traveling Salesman Problem
|
|
|
- Kenneth Ray
- 9 years ago
- Views:
Transcription
1 High Performance CUDA Accelerated Local Optimization in Traveling Salesman Problem Kamil Rocki, PhD Department of Computer Science Graduate School of Information Science and Technology The University of Tokyo
2 Why TSP? It s hard (NP-hard to be exact) Especially to parallelize But easy to understand A basic problem in CS TSPLIB - a set of standard problems Allows comparisons of algorithms It s simple to adjust the problem size Platform for the study of general methods that can be applied to a wide range of discrete optimization problems
3 Not only salesmen s headache Computer wiring Vehicle routing Crystallography Robot control pla85900.tsp It took 15 years to solve this problem (solved in 2006)
4
5 Local/Global Optimization Usually the space is huge There are many valleys and local minima Local search might terminate in one of the local minima Global search should find the best solution
6 Iterated Global Search - Global Optimization (Approximate algorithm) Local search is followed by some sort of tour randomization and it s repeated (until there s time left) perturbation Most of the time is spent in the Local Search part cost s % % % % % % % % % % % % % % s* s* % % solution space S % % % % % kroe100 ch130 ch150 kroa200 ts225 pr299 pr439 rat783 vm1084 pr2392 Thomas Stützle, Iterated Local Search & Variable Neighborhood Search MN Summerschool, Tenerife, 2003 p.9
7 Iterated Global Search - Global Optimization (Approximate algorithm) 1: procedure I TERATED L OCAL S EARCH 2: s 0 := GenerateInitialSolution() 3: s * := 2optLocalSearch (s 0 ) GPU 4: while (termination condition not met) 5: s := P erturbation (s * ) 6: s * := 2optLocalSearch (s ) GPU 7: s * := AcceptanceCriterion (s *, s * ) 8: end while 9: end procedure
8 TSP - Big valley theory (Why does ILS work?) Local minima form clusters The better the solution, the closer it is to the global optimum x Cost Cost Global minimum is somewhere in the middle x Mean distance to other solutions (a) Distance to optimal (b) ILS worsens the tour only slightly Local Search is repeated close to the previous minimum S. Baluja, A.G. Barto, K. D. Boese, J. Boyan, W. Buntine, T. Carson, R. Caruana, D. J. Cook, S. Davies, T. Dean, T. G. Dietterich, P. J. Gmytrasiewicz, S. Hazlehurst, R. Impagliazzo, A.K. Jagota, K. E. Kim, A. Mcgovern, R. Moll, A. W. Moore, L. Sanchis, L. Su, C. Tseng, K. Tumer, X. Wang, D. H. Wolpert, Justin Boyan, Wray Buntine, Arun Jagota: Statistical Machine Learning for Large-Scale Optimization, 2000
9 2-opt local optimization Exchange a pair of edges to obtain a better solution if the following condition is true distance(b,f) + distance(g,d) > distance(b,d) + distance(g,f) A B A B C C D D E E F F G G
10 2-opt local optimization In order to find such a pair need to perform: n 2 for (int i = 1; i < points - 2; i++) for (int j = i + 1; j < points - 1; j++) 2 =(n 2) (n 3)/2 checks { if (distance(i, i-1) + distance(j+1, j) > There are some pruning techniques in more complex algorithms I analyze all possible pairs distance(i, j+1) + distance(i-1, j)) update best i and j; city problem -> 49.9 Million pairs city problem -> 5 Billion pairs } remove edges (best i, best i-1) and (best j+1, best j) add edges (best i,best j+1) and (best i-1, best j)
11 Obtaining a distance Naive approach - calculate it each time it s needed Smart way - reuse the results For a 100-city problem it is approximately 5 times faster (CPU) Problem Number Memory Memory (TSPLIB) of cities needed needed (points) for LUT for coordinates (MB) (kb) kroe ch ch kroa ts pr pr rat vm pr pcb fl fnl n 2 n
12 Obtaining a distance Naive approach - calculate it each time it s needed Smart way - reuse the results For a 100-city problem it is approximately 5 times faster (CPU) For a city problem it can be slower Cache Problem Number Memory Memory (TSPLIB) of cities needed needed (points) for LUT for coordinates (MB) (kb) kroe ch ch kroa ts pr pr rat vm pr pcb fl fnl n 2 n
13 Obtaining a distance Naive approach - calculate it each time it s needed Smart way - reuse the results For a 100-city problem it is approximately 5 times faster (CPU) For a city problem it can be slower Cache With multiple cores, it becomes slower Cache/memory Problem Number Memory Memory (TSPLIB) of cities needed needed (points) for LUT for coordinates (MB) (kb) kroe ch ch kroa ts pr pr rat vm pr pcb fl fnl n 2 n
14 LUT (Look Up Table) vs recalculation Not-so-clever algorithm Time Clever algorithm Number of cores
15 A change in algorithm design - sequential Memory is free, reuse computed results Use sophisticated algorithms, prune the search space
16 A change in algorithm design - parallel Memory is free, reuse computed results Limited memory (size, throughput), computing is free (almost) Use sophisticated algorithms, prune the search space Sometimes brute-force search might be faster (especially on GPUs) Avoiding divergence Same amount of time to process 10, 100 even elements in parallel But -> empty flops
17 GPU implementation GPU uses this simple function for EUC_2D problems device int calculatedistance2d (unsigned int i, unsigned int j, city_coords* coords) { //registers + shared memory register float dx, dy; dx = coords[i].x - coords[j].x; dy = coords[i].y - coords[j].y; } return (int)(sqrtf(dx * dx + dy * dy) + 0.5f); //6 FLOPs + sqrt
18 GPU implementation dist(i,j) = dist (j,i), dist (i,i) = 0 Example for 32 threads Iteration 1 0,1 0 Matrix of city pairs to be checked for a possible swap, each pair corresponds to one GPU job N 0,2 1 1,2 2 0,3 3 1,3 4 2,3 5 0,4 6 1,4 7 2,4 8 3,4 9 0,5 10 1,5 11 2,5 12 3,5 13 4,5 14 0,6 15 1,6 16 2,6 17 3,6 18 4,6 19 5,6 20 0,7 21 1,7 22 2,7 23 3,7 24 4,7 25 5,7 26 6,7 27 0,8 28 1,8 29 2,8 30 3,8 31 4,8 32 5,8 33 6,8 34 7,8 35 0,9 37 1,9 38 2,9 39 3,9 40 4,9 41 5,9 42 6,9 43 7,9 44 Iteration 2 Iteration 3 8,9 45 N N Iteration 4 Each thread has to check: if (distance(i, i-1) + distance(j+1, j) > distance(i, j+1) + distance(i-1, j)) update best i and j; Iteration 5 Iteration 6 N Iteration 7
19 Optimizations Preorder the coordinates in the route s order on CPU The reads from the global memory are no longer scattered - much faster 8B needed instead of 10B per city Route Array of coordinates Get coordinates of point at position n x = coordinates[route[n]].x y = coordinates[route[n]].y Total size: n * sizeof(route data type) + 2 * n * sizeof(coordinate data type) No unnecessary address calculations The problem can be split if it s too big for the shared memory s size Route + coordinates = Array of ordered coordinates Get coordinates of point at position n x = ordered_coordinates[n].x y = ordered_coordinates[n].y Total size: 2 * n * sizeof(coordinate data type)
20 GPU implementation An overview of the algorithm: 1. Copy the ordered coordinates to the GPU global memory (CPU) 2. Execute the kernel 3. Copy the ordered coordinates to the shared memory 4. Calculate swap effect of one pair 5. Find the best value and store it in the global memory 6. Read the result (CPU)
21 What if coordinates size exceeds the size of shared memory? This part requires 2*m float2 elements to be calculated Let s assume a big problem N too big for the shared memory This part requires 2*m float2 elements to be calculated This part requires 2*m float2 elements to be calculated N coordinates range (N-2m+1,N-m) coordinates range (N-m+1,N) N coordinates range (0,m) coordinates range (m+1,2m)
22 What if coordinates size exceeds the size of shared memory? Problem of size N Assuming local memory of limited size m coordinates Needed data: two arrays coordinates range (m/2+1,m) 0 N coordinates range (N-m/2+1,N) 0 N coordinates range (N-m/2+1,N) Total size: 2 * 2 * n * sizeof(coordinate data type) coordinates range (m/2+1,m) It is possible to calculate distances in arbitrary segments
23 What if coordinates size exceeds the size of shared memory? Arbitrarily big problems can be solved 2 sets of coordinates needed Work can be split and run on multiple devices device int calculatedistance2d( unsigned int i, unsigned int j, city_coords* coordsa, city_coords* coordsb) { register float dx, dy; dx = coordsa[i].x - coordsb[j].x; dy = coordsa[i].y - coordsb[j].y; } return (int)(sqrtf(dx * dx + dy * dy) + 0.5f);
24 Results 2-opt step processing time 10,000s Approximately 22 billion comparisons/s 1,000s (3960X) A distance(b,f) + distance(g,d) > distance(b,d) + distance(g,f) B A B 100s C C D D 10s E E F F 1s G G 0.1s 0.01s For a city problem 0.001s s pairs need to be checked s kroe100 kroa200 pr439 pr2392 fnl4461 d15112 pla85900 ara tsp
25 Results Time needed for a single local optimization run GTX 680 Compared to Intel i7-3960x CPU Up to 300x speedup (single core, no SSE) Up to 40x speedup (Intel OpenCL) Table II 2-OPT - TIME NEEDED FOR A SINGLE RUN, CUDA, GPU: GTX680, CPU: I NTEL I X Problem GPU Host to Device GPU 2-opt Time to Initial Optimized (TSPLIB) kernel device to host total checks/s first Length Length time copy copy time (Millions) minimum full 2-opt time time 2-opt from MF berlin52 20 µs 50 µs 11 µs 81 µs µs kroe µs 50 µs 11 µs 82 µs µs ch µs 50 µs 11 µs 82 µs µs ch µs 50 µs 11 µs 84 µs µs kroa µs 50 µs 11 µs 85 µs ms ts µs 50 µs 11 µs 85µs ms pr µs 50 µs 11 µs 87 µs ms pr µs 50 µs 11 µs 93 µs ms rat µs 51 µs 11 µs 115 µs ms vm µs 51 µs 11 µs 142 µs ms pr µs 53 µs 11 µs 363 µs ms pcb µs 55 µs 11 µs 547 µs ms fl µs 54 µs 11 µs 788 µs ms fnl µs 58 µs 11 µs 815 µs ms rl µs 59 µs 11 µs 1079 µs ms pla µs 58 µs 11 µs 1616 µs ms usa µs 66 µs 11 µs 4805 µs s d µs 69 µs 11 µs 6043 µs s d µs 75 µs 11 µs 9014 µs s sw ms 85 µs 11 µs ms s pla ms 96 µs 11 µs ms s pla ms 205 µs 11 µs ms s sra ms 249 µs 11 µs ms s usa ms 287 µs 11 µs ms s ara s 740 µs 11 µs s s lra s 1774 µs 11 µs s m lrb s 2833 µs 11 µs s h
26 Quality Time (s)
27 I have implemented and published LOGO TSP Solver LOGO - LOcal GPU Optimization CUDA + OpenCL Linux/Mac: Source code, Windows: Binary with GUI GPL License + More resources about LOGO (slides, papers)
28 Summary Big TSP problems can be solved very quickly on GPU! (The fastest GPU TSP Solver?) Sometimes naive algorithms might run faster overall in parallel Large and very large problems typically can be decomposed into smaller ones and fit fast memory Avoid memory accesses (true for both CPUs and GPUs now) Memory is a scarce resource, FLOPs are abundant Exploit locality by manually reordering data (additional 0.1s of preprocessing may save 1s somewhere else)
29 Thank you!
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
Iterated Local Search. Variable Neighborhood Search
Iterated Local Search Variable Neighborhood Search Thomas Stützle [email protected] http://www.intellektik.informatik.tu-darmstadt.de/ tom. Darmstadt University of Technology Department
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la
U N C L A S S I F I E D
CUDA and Java GPU Computing in a Cross Platform Application Scot Halverson [email protected] LA-UR-13-20719 Slide 1 What s the goal? Run GPU code alongside Java code Take advantage of high parallelization Utilize
Introduction to GPU Computing
Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture
Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected]
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected] Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
Parallel Prefix Sum (Scan) with CUDA. Mark Harris [email protected]
Parallel Prefix Sum (Scan) with CUDA Mark Harris [email protected] April 2007 Document Change History Version Date Responsible Reason for Change February 14, 2007 Mark Harris Initial release April 2007
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University
RAMCloud and the Low- Latency Datacenter John Ousterhout Stanford University Most important driver for innovation in computer systems: Rise of the datacenter Phase 1: large scale Phase 2: low latency Introduction
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Clustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu [email protected] Bin Zhang [email protected] Meichun Hsu [email protected] ABSTRACT In this paper, we report our research on using GPUs to accelerate
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
GPU-BASED TUNING OF QUANTUM-INSPIRED GENETIC ALGORITHM FOR A COMBINATORIAL OPTIMIZATION PROBLEM
GPU-BASED TUNING OF QUANTUM-INSPIRED GENETIC ALGORITHM FOR A COMBINATORIAL OPTIMIZATION PROBLEM Robert Nowotniak, Jacek Kucharski Computer Engineering Department The Faculty of Electrical, Electronic,
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
Image Processing & Video Algorithms with CUDA
Image Processing & Video Algorithms with CUDA Eric Young & Frank Jargstorff 8 NVIDIA Corporation. introduction Image processing is a natural fit for data parallel processing Pixels can be mapped directly
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
Texture Cache Approximation on GPUs
Texture Cache Approximation on GPUs Mark Sutherland Joshua San Miguel Natalie Enright Jerger {suther68,enright}@ece.utoronto.ca, [email protected] 1 Our Contribution GPU Core Cache Cache
15-418 Final Project Report. Trading Platform Server
15-418 Final Project Report Yinghao Wang [email protected] May 8, 214 Trading Platform Server Executive Summary The final project will implement a trading platform server that provides back-end support
Efficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.
Algorithms Efficiency of algorithms Computational resources: time and space Best, worst and average case performance How to compare algorithms: machine-independent measure of efficiency Growth rate Complexity
MapGraph. A High Level API for Fast Development of High Performance Graphic Analytics on GPUs. http://mapgraph.io
MapGraph A High Level API for Fast Development of High Performance Graphic Analytics on GPUs http://mapgraph.io Zhisong Fu, Michael Personick and Bryan Thompson SYSTAP, LLC Outline Motivations MapGraph
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
Distributed communication-aware load balancing with TreeMatch in Charm++
Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration
Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration
Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Jinglin Zhang, Jean François Nezan, Jean-Gabriel Cousin, Erwan Raffin To cite this version: Jinglin Zhang,
PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)
PARALLEL JAVASCRIPT Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) JAVASCRIPT Not connected with Java Scheme and self (dressed in c clothing) Lots of design errors (like automatic semicolon
Intelligent Heuristic Construction with Active Learning
Intelligent Heuristic Construction with Active Learning William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, Hugh Leather E H U N I V E R S I T Y T O H F G R E D I N B U Space is BIG! Hubble Ultra-Deep Field
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
AMD GPU Architecture. OpenCL Tutorial, PPAM 2009. Dominik Behr September 13th, 2009
AMD GPU Architecture OpenCL Tutorial, PPAM 2009 Dominik Behr September 13th, 2009 Overview AMD GPU architecture How OpenCL maps on GPU and CPU How to optimize for AMD GPUs and CPUs in OpenCL 2 AMD GPU
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
GPU Accelerated Monte Carlo Simulations and Time Series Analysis
GPU Accelerated Monte Carlo Simulations and Time Series Analysis Institute of Physics, Johannes Gutenberg-University of Mainz Center for Polymer Studies, Department of Physics, Boston University Artemis
PHAST: Hardware-Accelerated Shortest Path Trees
PHAST: Hardware-Accelerated Shortest Path Trees Daniel Delling Andrew V. Goldberg Andreas Nowatzyk Renato F. Werneck Microsoft Research Silicon Valley, 1065 La Avenida, Mountain View, CA 94043, USA September
Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute [email protected].
Medical Image Processing on the GPU Past, Present and Future Anders Eklund, PhD Virginia Tech Carilion Research Institute [email protected] Outline Motivation why do we need GPUs? Past - how was GPU programming
Evaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software
GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding
Experiences on using GPU accelerators for data analysis in ROOT/RooFit
Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Pervasive Parallelism Laboratory Stanford University Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun Graph and its Applications Graph Fundamental
Analysis of Binary Search algorithm and Selection Sort algorithm
Analysis of Binary Search algorithm and Selection Sort algorithm In this section we shall take up two representative problems in computer science, work out the algorithms based on the best strategy to
PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms
PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms P. E. Vincent! Department of Aeronautics Imperial College London! 25 th March 2014 Overview Motivation Flux Reconstruction Many-Core
Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
Big Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
CUDA Programming. Week 4. Shared memory and register
CUDA Programming Week 4. Shared memory and register Outline Shared memory and bank confliction Memory padding Register allocation Example of matrix-matrix multiplication Homework SHARED MEMORY AND BANK
GeoImaging Accelerator Pansharp Test Results
GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
Trends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
GPU for Scientific Computing. -Ali Saleh
1 GPU for Scientific Computing -Ali Saleh Contents Introduction What is GPU GPU for Scientific Computing K-Means Clustering K-nearest Neighbours When to use GPU and when not Commercial Programming GPU
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
GPU-based Decompression for Medical Imaging Applications
GPU-based Decompression for Medical Imaging Applications Al Wegener, CTO Samplify Systems 160 Saratoga Ave. Suite 150 Santa Clara, CA 95051 [email protected] (888) LESS-BITS +1 (408) 249-1500 1 Outline
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
High Performance GPGPU Computer for Embedded Systems
High Performance GPGPU Computer for Embedded Systems Author: Dan Mor, Aitech Product Manager September 2015 Contents 1. Introduction... 3 2. Existing Challenges in Modern Embedded Systems... 3 2.1. Not
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
Load balancing in a heterogeneous computer system by self-organizing Kohonen network
Bull. Nov. Comp. Center, Comp. Science, 25 (2006), 69 74 c 2006 NCC Publisher Load balancing in a heterogeneous computer system by self-organizing Kohonen network Mikhail S. Tarkov, Yakov S. Bezrukov Abstract.
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Speeding Up RSA Encryption Using GPU Parallelization
2014 Fifth International Conference on Intelligent Systems, Modelling and Simulation Speeding Up RSA Encryption Using GPU Parallelization Chu-Hsing Lin, Jung-Chun Liu, and Cheng-Chieh Li Department of
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
Adaptive Stable Additive Methods for Linear Algebraic Calculations
Adaptive Stable Additive Methods for Linear Algebraic Calculations József Smidla, Péter Tar, István Maros University of Pannonia Veszprém, Hungary 4 th of July 204. / 2 József Smidla, Péter Tar, István
GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 2 Shared memory in detail
CS 575 Parallel Processing
CS 575 Parallel Processing Lecture one: Introduction Wim Bohm Colorado State University Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Jianqiang Dong, Fei Wang and Bo Yuan Intelligent Computing Lab, Division of Informatics Graduate School at Shenzhen,
High-Performance Processing of Large Data Sets via Memory Mapping A Case Study in R and C++
High-Performance Processing of Large Data Sets via Memory Mapping A Case Study in R and C++ Daniel Adler, Jens Oelschlägel, Oleg Nenadic, Walter Zucchini Georg-August University Göttingen, Germany - Research
MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA
MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS Julien Demouth, NVIDIA STAC-A2 BENCHMARK STAC-A2 Benchmark Developed by banks Macro and micro, performance and accuracy Pricing and Greeks for American
Virtualisa)on* and SAN Basics for DBAs. *See, I used the S instead of the zed. I m pretty smart for a foreigner.
Virtualisa)on* and SAN Basics for DBAs *See, I used the S instead of the zed. I m pretty smart for a foreigner. Brent Ozar - @BrentO BrentOzar.com/go/san BrentOzar.com/go/virtual Today s Agenda! How Virtualisa7on
! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.
Approximation Algorithms Chapter Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
Zabin Visram Room CS115 CS126 Searching. Binary Search
Zabin Visram Room CS115 CS126 Searching Binary Search Binary Search Sequential search is not efficient for large lists as it searches half the list, on average Another search algorithm Binary search Very
Physical Data Organization
Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor
GPU Programming in Computer Vision
Computer Vision Group Prof. Daniel Cremers GPU Programming in Computer Vision Preliminary Meeting Thomas Möllenhoff, Robert Maier, Caner Hazirbas What you will learn in the practical course Introduction
OpenACC Programming and Best Practices Guide
OpenACC Programming and Best Practices Guide June 2015 2015 openacc-standard.org. All Rights Reserved. Contents 1 Introduction 3 Writing Portable Code........................................... 3 What
SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs
SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs Fabian Hueske, TU Berlin June 26, 21 1 Review This document is a review report on the paper Towards Proximity Pattern Mining in Large
Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0
Optimization NVIDIA OpenCL Best Practices Guide Version 1.0 August 10, 2009 NVIDIA OpenCL Best Practices Guide REVISIONS Original release: July 2009 ii August 16, 2009 Table of Contents Preface... v What
NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
