ultra fast SOM using CUDA



Similar documents
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Introduction to GPU Programming Languages

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Clustering Billions of Data Points Using GPUs

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

GPU Parallel Computing Architecture and CUDA Programming Model

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

OpenCL Programming for the CUDA Architecture. Version 2.3

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPUs for Scientific Computing

Parallel Programming Survey

Parallel Computing with MATLAB

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Speeding Up RSA Encryption Using GPU Parallelization

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Stream Processing on GPUs Using Distributed Multimedia Middleware

Self Organizing Maps: Fundamentals

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Introduction to GPU hardware and to CUDA

CUDA programming on NVIDIA GPUs

Accelerating CFD using OpenFOAM with GPUs

Guided Performance Analysis with the NVIDIA Visual Profiler

Recommended hardware system configurations for ANSYS users

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

GPGPU Computing. Yong Cao

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Real-time Visual Tracker by Stream Processing

RevoScaleR Speed and Scalability

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Rethinking SIMD Vectorization for In-Memory Databases

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

ACCELERATION OF SPIKING NEURAL NETWORK ON GENERAL PURPOSE GRAPHICS PROCESSORS

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Parallelism and Cloud Computing

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Introduction to Cloud Computing

HPC with Multicore and GPUs

Introduction to GPU Computing

Introduction to GPU Architecture

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

Next Generation GPU Architecture Code-named Fermi

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Parallel Firewalls on General-Purpose Graphics Processing Units

The Methodology of Application Development for Hybrid Architectures

Fast Implementations of AES on Various Platforms

GPGPU Parallel Merge Sort Algorithm

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

MIDeA: A Multi-Parallel Intrusion Detection Architecture

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

BLM 413E - Parallel Programming Lecture 3

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

Computer Graphics Hardware An Overview

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Optimizing Application Performance with CUDA Profiling Tools

GPU Architecture. Michael Doggett ATI

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Speeding up GPU-based password cracking

NVIDIA GeForce GTX 750 Ti

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Using Data Mining for Mobile Communication Clustering and Characterization

GeoImaging Accelerator Pansharp Test Results

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

NVIDIA GeForce GTX 580 GPU Datasheet

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Introduction to CUDA C

High Performance Matrix Inversion with Several GPUs

Evaluation of CUDA Fortran for the CFD code Strukti

Visualization of Breast Cancer Data by SOM Component Planes

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

GPU-BASED TUNING OF QUANTUM-INSPIRED GENETIC ALGORITHM FOR A COMBINATORIAL OPTIMIZATION PROBLEM

Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

LVQ Plug-In Algorithm for SQL Server

Turbomachinery CFD on many-core platforms experiences and strategies

Grid e-services for Multi-Layer SOM Neural Network Simulation

Transcription:

ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A V QuEST Global

0.1 Abstract 01 0.2 Introduction 01 0.3 CUDA Programming Paradigm 01 0.4 Optimization Strategies 02 0.5 Results and Discussion 03 0.6 Conclusion and Future Work 04 0.7 Contact 04 0.8 Reference 04

Abstract SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. For efficient construction of large maps searching the bestmatching unit is usually the computationally heaviest operation in the SOM. The parallel nature of the algorithm and the huge computations involved makes it a good target for GPU based parallel implementation. This paper presents an overall idea of the optimization strategies used for the parallel implementation of Basic-SOM on GPU using CUDA programming paradigm. Keywords - Basic-SOM, Data Mining, CUDA. I. INTRODUCTION Self Organizing Map or SOM is a data visualization technique which reduces the dimensions of data through the use of selforganizing neural networks. The way SOMs go about reducing dimensions is by producing a map of usually 1 or 2 dimensions (in our case it is 2 dimensions) which plot the similarities of the data by grouping similar data items together. So SOMs accomplish two things, they reduce dimensions and display similarities. The first part of a SOM is the data. The selforganizing maps will project this n-dimensional data into something that can be better understood visually. In this step the input vector will search for the map node that makes the minimum Euclidian distance and is selected as the winner node. And the circular area around winner node is defined as proximity area. All the nodes in the proximity area do the study of input vector, and move a little to the input direction in accordance to formula: m(t+1) = m(t) + h (t)[ x(t) - m(t)] i i ci i At the beginning of the learning process the radius of the proximity area is fairly large, but it is made to shrink during learning. This ensures that the global order is obtained already at the beginning, whereas towards the end, as the radius gets smaller, the local corrections of the model vectors in the map will be more specific. There is a function to decease the proximity area and this can be explained by the following equation: 2 2 h(d, t) = exp(-d / k(r(t)) ), and h ci (t) = α(t). h(d, t) where, 2 2 d = (x - a) i + (y - b) i, and 0 < α(t) < 1 with, α(t) = α (1 t/t) 0 Here, α_0 is in the initial value of α, and the value is normally 0.2 to 0.5. T is a given study frequency. Briefing the SOM steps: Initialize Map for t from 0 to (number of iterations) Randomly select a sample Get best matching unit Scale neighbors Increase t End for 2 CUDA Programming Paradigm NVIDIA's CUDA architecture is a parallel computing architecture that delivers the graphics processor technology to general purpose GPU computing. It is organized as a set of multiprocessors, each having 8 CUDA cores and an on-chip memory shared all the 8 CUDA cores within the multiprocessor. Besides this a global memory space is also available which is shared by all the multiprocessors. The overall architecture is shown in Figure1. This architecture allows the GPU to be viewed as a data-parallel 1

computing device that operates as a coprocessor to the main CPU (the host). At the hardware level, the GPU is a collection of multiprocessors, with several processing elements in each. Each processor in the multiprocessor executes the same instruction in every cycle. Each can operate on its own data, which makes the multiprocessor a SIMD p r o c e s s o r. C o m m u n i c a t i o n b e t w e e n multiprocessors and between the GPU and the CPU is possible only through the global memory space, which can be accessed and modified by all the cores of the multiprocessors and also from the CPU thread. The processing elements of a multiprocessor can synchronize with each other, but there is no direct synchronization mechanism between the multiprocessors. programmer effort to utilize the GPUs at its peak and obtain the best performance. Efficient memory access patterns and the efficient utilization of available GPU memories like texture, shared and constant memories is also required. This may require the serial logic to be polished or rewritten to best suite the GPU's parallel architecture. 3 Optimization Strategies Most real world problems using the SOM logic require efficient construction of large maps. This involves processing of large data buffers and huge computations are involved in the matching process. Once the map got initialized with some random values, the rest of the computations which is compute intensive are mostly data parallel and this makes it a good target of GPU based parallel implementation. Our implementation splits the entire SOM implementation logic into 3 major CUDA kernels. (a) Distance finding, (b) Reduction operation for minimum finding and (c) Adjust weight kernel. Figure 1: GPU - High level architecture For the programmer, CUDA provides a C extension, which makes it easy to explore the parallel processing power of GPUs for computeintensive general purpose applications. This hides the graphical behavior of GPUs from developers and gives them the comfort of using it a s a m a n y - c o r e p l a t f o r m w h i c h i s computationally much faster. The CUDA APIs hides all the complexities involved in thread creation and will generate light-weight threads for execution on CUDA cores, all at the ease of a function call. However, the total number of CUDA threads and its occupancy in multiprocessors is fully dependent on the parameters specified at the API invocation. There lies the need for significant The CUDA kernels are optimized to achieve their best performance by the efficient utilization of the GPU memory, i.e. by using constant, shared and texture memories and by avoiding non-coalesced global memory accesses. Some constant parameters which remains the same for a given input data and used within CUDA kernel for various calculations, is placed inside the GPU's constant memory (which is the fastest of all the available GPU memories; but is very much limited in availability). The data buffer given as input to the SOM logic is used for various calculations inside the above CUDA kernels. So this input data buffer is copied to the device's global memory prior to the kernel invocations. To avoid too many accesses to global memory (which is much slower), we tried with texture binding the input buffer (thereby 2

making it cached), and the cached texture is used for all the computations within the kernels. All the other buffers allocated in the GPU's global memory and used within the CUDA kernels are either cached using texture binding or copied to shared memory (which is the onchip memory) to achieve the best performance. The kernel invocations are also checked for better utilizations of multiprocessor cores by making the block size (i.e. the number threads within a block and this will be considered as a single unit to be active in a multiprocessor) optimal based on register usage, shared memory usage and other parameters. 4 Results and Discussion GTX260 and Tesla C1060. Figure 2 and Figure 3 plots the time measured (in seconds) on these platforms. Figure 2: Comparison Chart with attribute count 32 (Time taken in seconds) Figure 3: Comparison Chart with attribute count 64 (Time taken in seconds) Table 1: Performance Comparison Table (in terms of speed-up) * GTX260 is not having enough GPU memory to handle this data. ** With iterations and data items fixed to 20 and 16 resp. Table 1 gives the performance comparisons in figures for various nvidia GPUs by changing the map size and attribute count. The performance is measured in terms of speedup with respect to the CPU (Intel Core(TM) 2 Quad CPU Q8400 @ 2.66GHz (2666 MHz), using only single core) implementation and is taken for According to the current GPU implementation, increasing the map size will increase the parallel paths in GPU and increasing the attribute count will increase the computation within each CUDA thread (since GPUs are good at computation this will not add much to the CUDA kernel execution time). Analyzing the results from table 1, it is obvious that as the map size or attribute count goes higher, the CUDA implementation will be much faster compared to the corresponding CPU implementation. 3

5 Conclusions and Future Work We have shown that the SOM implementation can benefit greatly from using CUDA-capable GPU. Overall the best CUDA implementation provides a speed of 84x (compared to CPU implementation) for 2000x2000 map size and with 64 attribute on a Tesla C1060 GPU. When the map size or the attribute count is too large, there is not enough space in the GPU to hold the entire data at the same time. Of the above discussed test platforms, GTX260 is having only 896MB memory size, while Tesla C1060 is having 4GB memory size. The maximum limit for map size and attribute count is much lower for GTX260 compared to the Tesla, which makes some map sizes not supported in GTX260 as given in the results table. To avoid this we have to split the data according to the device memory availability and do the computations separately for various blocks and for this we have to rewrite the logic accordingly, which is planned to be our next milestone. 7 References [1] T. Kohonen, Self-Organizing Maps, Springer Series in Information Sciences, Vol. 30, Springer, Berlin, Heidelberg, New York, 1995, 1997, 2001. Third Extended Edition [2] Teuvo Kohonen, Self-organizing maps, Springer-Verlag New York, Inc., Secaucus, NJ, 1997 [3] nvidia CUDA Programming Guide The straightforward CUDA implementations can achieve substantial benefits. For further performance tuning, significant programmer effort can be required to fully utilize GPU's potential when irregular memory access patterns or small kernels are present. Despite this extra effort required to realize the potential of the GPU, the benefits can be dramatic. Our experience with CUDA shows the power of the GPU as a parallel platform, and is a proof to the fact that, many-core platforms are having the potential to improve the performance of various compute intensive applications. 6 Contact QuEST-NVIDIA Center for GPU computing. 4

www.quest-global.com