High Performance CUDA Accelerated Local Optimization in Traveling Salesman Problem



Similar documents
GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Rethinking SIMD Vectorization for In-Memory Databases

Iterated Local Search. Variable Neighborhood Search

ultra fast SOM using CUDA

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

U N C L A S S I F I E D

Introduction to GPU Computing

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Binary search tree with SIMD bandwidth optimization using SSE

Turbomachinery CFD on many-core platforms experiences and strategies

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Next Generation GPU Architecture Code-named Fermi

Clustering Billions of Data Points Using GPUs

RevoScaleR Speed and Scalability

GPU-BASED TUNING OF QUANTUM-INSPIRED GENETIC ALGORITHM FOR A COMBINATORIAL OPTIMIZATION PROBLEM

GPUs for Scientific Computing

Image Processing & Video Algorithms with CUDA

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Texture Cache Approximation on GPUs

Final Project Report. Trading Platform Server

Efficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.

MapGraph. A High Level API for Fast Development of High Performance Graphic Analytics on GPUs.

Stream Processing on GPUs Using Distributed Multimedia Middleware

HPC with Multicore and GPUs

Parallel Algorithm Engineering

Distributed communication-aware load balancing with TreeMatch in Charm++

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Intelligent Heuristic Construction with Active Learning

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

PHAST: Hardware-Accelerated Shortest Path Trees

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute

Evaluation of CUDA Fortran for the CFD code Strukti

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Analysis of Binary Search algorithm and Selection Sort algorithm

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

FPGA-based Multithreading for In-Memory Hash Joins

Big Graph Processing: Some Background

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

CUDA Programming. Week 4. Shared memory and register

GeoImaging Accelerator Pansharp Test Results

Introduction to GPU Programming Languages

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Trends in High-Performance Computing for Power Grid Applications

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU for Scientific Computing. -Ali Saleh

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

GPU-based Decompression for Medical Imaging Applications

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

High Performance GPGPU Computer for Embedded Systems

Similarity Search in a Very Large Scale Using Hadoop and HBase

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Speeding Up RSA Encryption Using GPU Parallelization

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Adaptive Stable Additive Methods for Linear Algebraic Calculations

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

CS 575 Parallel Processing

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

High-Performance Processing of Large Data Sets via Memory Mapping A Case Study in R and C++

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

Virtualisa)on* and SAN Basics for DBAs. *See, I used the S instead of the zed. I m pretty smart for a foreigner.

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

High Performance Computing in CST STUDIO SUITE

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Zabin Visram Room CS115 CS126 Searching. Binary Search

Physical Data Organization

GPU Programming in Computer Vision

OpenACC Programming and Best Practices Guide

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Transcription:

High Performance CUDA Accelerated Local Optimization in Traveling Salesman Problem Kamil Rocki, PhD Department of Computer Science Graduate School of Information Science and Technology The University of Tokyo

Why TSP? It s hard (NP-hard to be exact) Especially to parallelize But easy to understand A basic problem in CS TSPLIB - a set of standard problems Allows comparisons of algorithms It s simple to adjust the problem size Platform for the study of general methods that can be applied to a wide range of discrete optimization problems

Not only salesmen s headache Computer wiring Vehicle routing Crystallography Robot control pla85900.tsp It took 15 years to solve this problem (solved in 2006)

Local/Global Optimization Usually the space is huge There are many valleys and local minima Local search might terminate in one of the local minima Global search should find the best solution

Iterated Global Search - Global Optimization (Approximate algorithm) Local search is followed by some sort of tour randomization and it s repeated (until there s time left) perturbation Most of the time is spent in the Local Search part cost s 100.000% 99.9700% 99.600% 99.7000% 99.7500% 99.8000% 99.5400% 99.200% 99.3500% 99.1200% 98.800% 98.400% 98.4000% 98.9000% s* s* 98.000% 97.600% solution space S 97.200% 96.800% 96.400% 96.3000% 96.000% kroe100 ch130 ch150 kroa200 ts225 pr299 pr439 rat783 vm1084 pr2392 Thomas Stützle, Iterated Local Search & Variable Neighborhood Search MN Summerschool, Tenerife, 2003 p.9

Iterated Global Search - Global Optimization (Approximate algorithm) 1: procedure I TERATED L OCAL S EARCH 2: s 0 := GenerateInitialSolution() 3: s * := 2optLocalSearch (s 0 ) GPU 4: while (termination condition not met) 5: s := P erturbation (s * ) 6: s * := 2optLocalSearch (s ) GPU 7: s * := AcceptanceCriterion (s *, s * ) 8: end while 9: end procedure

TSP - Big valley theory (Why does ILS work?) Local minima form clusters The better the solution, the closer it is to the global optimum x 103 32.60 32.60 32.40 32.40 32.20 32.20 32.00 32.00 31.80 31.80 31.60 31.60 31.40 31.40 Cost Cost Global minimum is somewhere in the middle x 103 31.20 31.20 31.00 31.00 30.80 30.80 30.60 30.60 30.40 30.40 30.20 30.20 30.00 30.00 29.80 29.80 29.60 29.60 215.00 220.00 225.00 230.00 235.00 240.00 Mean distance to other solutions (a) 245.00 160.00 180.00 200.00 220.00 Distance to optimal (b) ILS worsens the tour only slightly Local Search is repeated close to the previous minimum S. Baluja, A.G. Barto, K. D. Boese, J. Boyan, W. Buntine, T. Carson, R. Caruana, D. J. Cook, S. Davies, T. Dean, T. G. Dietterich, P. J. Gmytrasiewicz, S. Hazlehurst, R. Impagliazzo, A.K. Jagota, K. E. Kim, A. Mcgovern, R. Moll, A. W. Moore, L. Sanchis, L. Su, C. Tseng, K. Tumer, X. Wang, D. H. Wolpert, Justin Boyan, Wray Buntine, Arun Jagota: Statistical Machine Learning for Large-Scale Optimization, 2000

2-opt local optimization Exchange a pair of edges to obtain a better solution if the following condition is true distance(b,f) + distance(g,d) > distance(b,d) + distance(g,f) A B A B C C D D E E F F G G

2-opt local optimization In order to find such a pair need to perform: n 2 for (int i = 1; i < points - 2; i++) for (int j = i + 1; j < points - 1; j++) 2 =(n 2) (n 3)/2 checks { if (distance(i, i-1) + distance(j+1, j) > There are some pruning techniques in more complex algorithms I analyze all possible pairs distance(i, j+1) + distance(i-1, j)) update best i and j; 10000 city problem -> 49.9 Million pairs 100000 city problem -> 5 Billion pairs } remove edges (best i, best i-1) and (best j+1, best j) add edges (best i,best j+1) and (best i-1, best j)

Obtaining a distance Naive approach - calculate it each time it s needed Smart way - reuse the results For a 100-city problem it is approximately 5 times faster (CPU) Problem Number Memory Memory (TSPLIB) of cities needed needed (points) for LUT for coordinates (MB) (kb) kroe100 100 0.038 0.78 ch130 130 0.065 1.02 ch150 150 0.086 1.17 kroa200 200 0.15 1.56 ts225 225 0.19 1.75 pr299 299 0.34 2.34 pr439 439 0.74 3.43 rat783 783 2.34 6.12 vm1084 1084 4.48 8.47 pr2392 2392 21.8 18.69 pcb3038 3038 35.21 23.73 fl3795 3795 54.9 29.65 fnl4461 4461 75.9 34.85 n 2 n

Obtaining a distance Naive approach - calculate it each time it s needed Smart way - reuse the results For a 100-city problem it is approximately 5 times faster (CPU) For a 100000-city problem it can be slower Cache Problem Number Memory Memory (TSPLIB) of cities needed needed (points) for LUT for coordinates (MB) (kb) kroe100 100 0.038 0.78 ch130 130 0.065 1.02 ch150 150 0.086 1.17 kroa200 200 0.15 1.56 ts225 225 0.19 1.75 pr299 299 0.34 2.34 pr439 439 0.74 3.43 rat783 783 2.34 6.12 vm1084 1084 4.48 8.47 pr2392 2392 21.8 18.69 pcb3038 3038 35.21 23.73 fl3795 3795 54.9 29.65 fnl4461 4461 75.9 34.85 n 2 n

Obtaining a distance Naive approach - calculate it each time it s needed Smart way - reuse the results For a 100-city problem it is approximately 5 times faster (CPU) For a 100000-city problem it can be slower Cache With multiple cores, it becomes slower Cache/memory Problem Number Memory Memory (TSPLIB) of cities needed needed (points) for LUT for coordinates (MB) (kb) kroe100 100 0.038 0.78 ch130 130 0.065 1.02 ch150 150 0.086 1.17 kroa200 200 0.15 1.56 ts225 225 0.19 1.75 pr299 299 0.34 2.34 pr439 439 0.74 3.43 rat783 783 2.34 6.12 vm1084 1084 4.48 8.47 pr2392 2392 21.8 18.69 pcb3038 3038 35.21 23.73 fl3795 3795 54.9 29.65 fnl4461 4461 75.9 34.85 n 2 n

LUT (Look Up Table) vs recalculation Not-so-clever algorithm Time Clever algorithm Number of cores

A change in algorithm design - sequential Memory is free, reuse computed results Use sophisticated algorithms, prune the search space

A change in algorithm design - parallel Memory is free, reuse computed results Limited memory (size, throughput), computing is free (almost) Use sophisticated algorithms, prune the search space Sometimes brute-force search might be faster (especially on GPUs) Avoiding divergence Same amount of time to process 10, 100 even 10000 elements in parallel But -> empty flops

GPU implementation GPU uses this simple function for EUC_2D problems device int calculatedistance2d (unsigned int i, unsigned int j, city_coords* coords) { //registers + shared memory register float dx, dy; dx = coords[i].x - coords[j].x; dy = coords[i].y - coords[j].y; } return (int)(sqrtf(dx * dx + dy * dy) + 0.5f); //6 FLOPs + sqrt

GPU implementation dist(i,j) = dist (j,i), dist (i,i) = 0 Example for 32 threads Iteration 1 0,1 0 Matrix of city pairs to be checked for a possible swap, each pair corresponds to one GPU job N 0,2 1 1,2 2 0,3 3 1,3 4 2,3 5 0,4 6 1,4 7 2,4 8 3,4 9 0,5 10 1,5 11 2,5 12 3,5 13 4,5 14 0,6 15 1,6 16 2,6 17 3,6 18 4,6 19 5,6 20 0,7 21 1,7 22 2,7 23 3,7 24 4,7 25 5,7 26 6,7 27 0,8 28 1,8 29 2,8 30 3,8 31 4,8 32 5,8 33 6,8 34 7,8 35 0,9 37 1,9 38 2,9 39 3,9 40 4,9 41 5,9 42 6,9 43 7,9 44 Iteration 2 Iteration 3 8,9 45 N N Iteration 4 Each thread has to check: if (distance(i, i-1) + distance(j+1, j) > distance(i, j+1) + distance(i-1, j)) update best i and j; Iteration 5 Iteration 6 N Iteration 7

Optimizations Preorder the coordinates in the route s order on CPU The reads from the global memory are no longer scattered - much faster 8B needed instead of 10B per city Route 0 5 3 2 6 12 4 15 13 11 14 9 1 8 10 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Array of coordinates Get coordinates of point at position n x = coordinates[route[n]].x y = coordinates[route[n]].y Total size: n * sizeof(route data type) + 2 * n * sizeof(coordinate data type) No unnecessary address calculations The problem can be split if it s too big for the shared memory s size Route + coordinates = Array of ordered coordinates 0 5 3 2 6 12 4 15 13 11 14 9 1 8 10 7 Get coordinates of point at position n x = ordered_coordinates[n].x y = ordered_coordinates[n].y Total size: 2 * n * sizeof(coordinate data type)

GPU implementation An overview of the algorithm: 1. Copy the ordered coordinates to the GPU global memory (CPU) 2. Execute the kernel 3. Copy the ordered coordinates to the shared memory 4. Calculate swap effect of one pair 5. Find the best value and store it in the global memory 6. Read the result (CPU)

What if coordinates size exceeds the size of shared memory? This part requires 2*m float2 elements to be calculated Let s assume a big problem N too big for the shared memory This part requires 2*m float2 elements to be calculated This part requires 2*m float2 elements to be calculated N coordinates range (N-2m+1,N-m) coordinates range (N-m+1,N) N coordinates range (0,m) coordinates range (m+1,2m)

What if coordinates size exceeds the size of shared memory? Problem of size N Assuming local memory of limited size m coordinates Needed data: two arrays coordinates range (m/2+1,m) 0 N coordinates range (N-m/2+1,N) 0 N coordinates range (N-m/2+1,N) Total size: 2 * 2 * n * sizeof(coordinate data type) coordinates range (m/2+1,m) It is possible to calculate distances in arbitrary segments

What if coordinates size exceeds the size of shared memory? Arbitrarily big problems can be solved 2 sets of coordinates needed Work can be split and run on multiple devices device int calculatedistance2d( unsigned int i, unsigned int j, city_coords* coordsa, city_coords* coordsb) { register float dx, dy; dx = coordsa[i].x - coordsb[j].x; dy = coordsa[i].y - coordsb[j].y; } return (int)(sqrtf(dx * dx + dy * dy) + 0.5f);

Results 2-opt step processing time 10,000s Approximately 22 billion comparisons/s 1,000s (3960X) A distance(b,f) + distance(g,d) > distance(b,d) + distance(g,f) B A B 100s C C D D 10s E E F F 1s G G 0.1s 0.01s For a 200.000-city problem 0.001s 20 10 9 0.0001s pairs need to be checked 0.00001s kroe100 kroa200 pr439 pr2392 fnl4461 d15112 pla85900 ara238025.tsp

Results Time needed for a single local optimization run GTX 680 Compared to Intel i7-3960x CPU Up to 300x speedup (single core, no SSE) Up to 40x speedup (Intel OpenCL) Table II 2-OPT - TIME NEEDED FOR A SINGLE RUN, CUDA, GPU: GTX680, CPU: I NTEL I 7-3960X Problem GPU Host to Device GPU 2-opt Time to Initial Optimized (TSPLIB) kernel device to host total checks/s first Length Length time copy copy time (Millions) minimum full 2-opt time time 2-opt from MF berlin52 20 µs 50 µs 11 µs 81 µs 76.56 687 µs 9951 8930 kroe100 21 µs 50 µs 11 µs 82 µs 282.92 897 µs 24846 23025 ch130 21 µs 50 µs 11 µs 82 µs 483.81 1183 µs 7849 7041 ch150 23 µs 50 µs 11 µs 84 µs 591.2 1342 µs 7742 7120 kroa200 24 µs 50 µs 11 µs 85 µs 1015.78 2.547 ms 34601 31685 ts225 24 µs 50 µs 11 µs 85µs 1145.97 1.090 ms 135193 128513 pr299 26 µs 50 µs 11 µs 87 µs 1878.46 2.555 ms 61493 54895 pr439 32 µs 50 µs 11 µs 93 µs 3307.85 3.067 ms 124127 115490 rat783 53 µs 51 µs 11 µs 115 µs 6385.53 8.827 ms 10734 9658 vm1084 80 µs 51 µs 11 µs 142 µs 8122.51 11.862 ms 287710 267210 pr2392 299 µs 53 µs 11 µs 363 µs 11065.33 89.577 ms 454068 412085 pcb3038 471 µs 55 µs 11 µs 547 µs 10689.4 168.02 ms 165688 147690 fl3795 723 µs 54 µs 11 µs 788 µs 10529.32 256.19 ms 34843 31312 fnl4461 746 µs 58 µs 11 µs 815 µs 14098.03 327.1 ms 215085 194746 rl5934 1009 µs 59 µs 11 µs 1079 µs 18172.88 291.09 ms 627417 582958 pla7397 1547 µs 58 µs 11 µs 1616 µs 18153.6 870.2 ms 26954302 24734292 usa13509 4728 µs 66 µs 11 µs 4805 µs 19481.58 6.5251 s 23271693 20984503 d15112 5963 µs 69 µs 11 µs 6043 µs 19272.07 8.6757 s 1810355 1652806 d18512 8928 µs 75 µs 11 µs 9014 µs 19268.93 14.975 s 745983 675638 sw24978 14.99 ms 85 µs 11 µs 15.08 ms 20863.46 37.284 s 1002187 908598 pla33810 26.81 ms 96 µs 11 µs 26.92 ms 21340.36 68.67 s 76680735 69763154 pla85900 168.4 ms 205 µs 11 µs 168.6 ms 21915.11 1109.2 s 163831861 149708033 sra104815 249.5 ms 249 µs 11 µs 249.8 ms 22017.38 2093.4 s 291403 264889 usa115475 302.3 ms 287 µs 11 µs 302.6 ms 22054.81 3337.8 s 7143419 6492848 ara238025 1.2934 s 740 µs 11 µs 1.2941 s 21902.18 26497 s 674838 610795 lra498378 5.6369 s 1774 µs 11 µs 5.6388 s 22031.65 3078 m 2467227 2264810 lrb744710 12.634 s 2833 µs 11 µs 12.638 s 21947.17 203.8 h 1867266 1692308

Quality 104 103.9 103.8 103.7 103.6 103.5 103.4 103.3 103.2 103.1 103 102.9 102.8 102.7 102.6 102.5 102.4 102.3 102.2 102.1 102 101.9 101.8 101.7 101.6 101.5 101.4 101.3 101.2 101.1 101 100.9 100.8 100.7 100.6 100.5 100.4 100.3 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3500 Time (s)

I have implemented and published LOGO TSP Solver LOGO - LOcal GPU Optimization CUDA + OpenCL http://olab.is.s.u-tokyo.ac.jp/~kamil.rocki/projects.html Linux/Mac: Source code, Windows: Binary with GUI GPL License + More resources about LOGO (slides, papers)

Summary Big TSP problems can be solved very quickly on GPU! (The fastest GPU TSP Solver?) Sometimes naive algorithms might run faster overall in parallel Large and very large problems typically can be decomposed into smaller ones and fit fast memory Avoid memory accesses (true for both CPUs and GPUs now) Memory is a scarce resource, FLOPs are abundant Exploit locality by manually reordering data (additional 0.1s of preprocessing may save 1s somewhere else)

Thank you!