Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration
|
|
|
- Reynold Hoover
- 10 years ago
- Views:
Transcription
1 Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Jinglin Zhang, Jean François Nezan, Jean-Gabriel Cousin, Erwan Raffin To cite this version: Jinglin Zhang, Jean François Nezan, Jean-Gabriel Cousin, Erwan Raffin. Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration. 27th Image and Vision Computing New Zealand (IVCNZ), Nov 2012, Dunedin, New Zealand. pp.nc, <hal > HAL Id: hal Submitted on 11 Dec 2012 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in or abroad, or from public or private research centers. L archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
2 Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Jinglin ZHANG Université Européenne de Bretagne, Jean-Francois NEZAN Université Européenne de Bretagne, xxx Université Européenne de Bretagne, Jean-Gabriel COUSIN Université Européenne de Bretagne, ABSTRACT Heterogeneous computing system increases the performance of parallel computing in many domain of general purpose computing with CPU, GPU and other accelerators. With Hardware developments, the software developments like Compute Unified Device Architecture(CUDA) and Open Computing Language (OpenCL) try to offer a simple and visualized tool for parallel computing. But it turn out to be more difficult than programming on CPU platform for optimization of performance. For one kind of parallel computing application, there are different configuration and parameters for various hardware platforms. In this paper, we apply the Hybrid Multi-cores Parallel Programming(HMPP)[1][2] to automatic-generates tunable code for GPU platform and show the result of implementation of Stereo Matching with detailed comparison with C code version and manual CUD- A version. The experimental results show that the default and optimized HMPP have the approximative 1 compared with CUDA implementation. And the HMPP workbench can greatly reduce the time of application development using parallel computing device. Keywords HMPP, CUDA, OpenCL, Stereo Matching 1. INTRODUCTION Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX...$ Stereo vision is a well suited technology which offer a precise description of 3D information[3]. Stereo Matching uses only two cameras and a processing unit to do the disparity matching and 3D reconstruction. In the same time, the exploitation of parallel programming tool on heterogenous system and architectures such as GPU, APU, FPGA and so on. For the GPU, there are great number of community employing the CUDA which proposed by Nvidia in Comparing with CUDA, Open Computing Language (OpenCL) provide better portable feature on multi hardware platform, make the heterogenous parallel computing possible not only on GPU platform[4]. Recently, Altera also propose their OpenCL solutions for FPGA in order to get the low power and quick to market[5]. In order to achieve the best performance in different platforms, we must follow some strict rules to transform the code form single core systems to heterogeneous or multi-core systems. All the procedures of transformation is complicated and much more redundancy. In our work, we use HMPP Workbench 3.0, a directive-based compiler targeted to GPUs and CPUs. Based on HMPP, programmers transform sequential code using paired directives to automatically generate CUDA or OpenCL kernels for various applications. codelet: routine implementation callsite: routine invocation Basicly with HMPP, using only these two paired directives can replace the procedure of manual writing the complex CUDA or OpenCL kernel code. Scott proposed a method of auto-tuning and optimize the code with several applicable transformation configurations and then pick the optimized version of the code with the best performance[2]. Jianbin propose an auto-tuning solution to data streams clustering with OpenCL[6]. In our work, we convert the sequential C code to parallel versions of Stereo Matching kernel with optimized HMPP directives in order to obtain better per-
3 formance compared with those default HMPP and manual CUDA kernels. These contributions of our work are: showing that automatic optimizing HMPP complier HMPP can be used to effectively parallelize and generate the GPU kernels. analyzing difference of performance of the HMPP CU- DA (CUDA kernel generated by HMPP) and manual CUDA kernel Section 2 elaborate the optimization strategies of convert C code to CUDA / OpenCL kernel code. Section 3 presents the Parallel Computing with HMPP. Section 4 presents the stereo matching algorithm and the implementation of HMP- P. Section 5 illustrate the experiments results between the manual CUDA and HMPP CUDA. A brief conclusion is given in section TRANSFORMATION RULES Before we use HMPP to transform C code to CUDA / OpenCL kernel code, we must acquaint the rules about programming with the parallel computing device. we present some optimization strategies for transformation used in our work. 2.1 Data Transferring In the image or video processing, there are several hundreds even thousand MB data to process. So we should avoid the unnecessary data transferring between host and device. The best choice is transfer as much as possible date at one time to fully utilize the bandwidth of GPU. 2.2 Fully Saturate The Computing Resource Using parallel computing language, programmers should think about the practical infrastructure of different platforms. Because no one cannot ensure that the same kernel has the best performance on different platforms. Supposing that NVIDIA s GPU support 512 active threads per Compute Unit and ATI s GPU support 1024 active threads per Compute Unit, our experiments use 256 threads as one group which means only 33.3% and 25% capability of Compute Unit be used in NVIDIA s and ATI s GPU separately. Therefore Programmers should organize as much as possible threads to fully employ the compute resource of device. In the Section 5, we illustrate the occupancy of our experimental device in different conditions. 2.3 Use shared memory Because of on chip, the shared memory is much faster than the device memory. In fact, accessing the shared memory is as faster as accessing a register, and the shared memory latency is roughly 100x lower than the device memory latency tested in our experiments. To get maximum performance, the most important point is to use shared memory but avoid banks conflict. In order to run faster and avoid refetching from device memory, programmers should put the data into the local memory ahead of complex and repeated computing. 3. PARALLEL COMPUTING WITH HMPP Figure 1: HMPP Workbench framework. Recently more and more powerful multi-core intergraded architecture like Nvidia tegra3(quad-core CPU, 12 core G- PU supporting 3D stereo)[7], Snapdragon S4 APQ8064(Quad- Core CPU, Adreno 320 GPU supporting OpenCL)[8] support the parallel programming model like OpenCL. In such All-in-one chips, GPU is not only play the part of rending and display, it also can be responsible for the general scientific computing based on their powerful computing capacity and the greater bandwidth. The HMPP directive-based programming model offers a powerful syntax to efficiently offload computations on hardware accelerators and keep the original C or Fortran codes. Figure 2 show the HMPP Workbench framework. The basic paired directives syntax of HMPP are as follow: //HMPP c o d e l e t #pragma hmpp l a b e l 1 c o d e l e t, t a r g e t=cuda void t e s t ( int n, int A[ n ], int B[ n ] ) { for ( int i = 0 ; i < n ; i ++){ B[ i ] = A[ i ] A[ i ] ; int main ( int argc, char argv ) { //HMPP c a l l s i t e #pragma hmpp t e s t c a l l s i t e t e s t (99, A, B ) ; Codelet is used before the definition of C functions which could be transformed in to parallel kernels. Callsite is the HMPP directive for invocation on such an accelerator device. 4. STEREO MATCHING ALGORITHMS Stereo vision aim to reconstruct a disparity map (depth information) from two views. For real-time stereo systems both speed and accuracy are crucial criterions. In order to satisfy the demand of real time, it must be with the benefit of
4 hardware acceleration, especially data parallel architectures- GPU. After analyzing the sequential C code of stereo vision, the most stereo matching algorithms spent a large percent of workload data parallel computing around every pixels. So the stereo matching is suitable for CUDA or HMPP accelerating. Here we adopt the ESAW (exponential step size adaptive weight)[9] and ESMP (exponential step size message propagation)[10]stereo algorithm, because the author offer the source code of CUDA to download and we can compare the performance of the manual CUDA and CUDA generated by HMPP directly. 4.1 Default HMPP Transformation For the version of HMPP CUDA, we only insert these two lines directives into the source C code without any modification like that: #pragma hmpp s t e r e o c o d e l e t, t a r g e t=cuda, extern void s t e r e o ( ) { int main ( int argc, char argv ) { #pragma hmpp s t e r e o c a l l s i t e s t e r e o ( ) ; Only two lines HMPP directives codelet and callsite can replace writing complicated manual CUDA code. And the difference of performance between the manual CUDA and HMPP CUDA will be described in the section Optimized HMPP Transformation As we illustrated in the Section2, we should use the shared memory and even L1 cache to optimize the HMPP code. In our Optimized version of HMPP code: 1. we use the hmppcg directives to determine which loop should be parallelized. // Exemple : HMPP d i r e c t i v e s : u n r o l l f o r l o o p #pragma hmppcg g r i d i f y ( j, i ) for ( j = 0 ; j < HEIGHT; j ++){ for ( i = 0 ; i < WIDTH; i ++){ 2. we use the hmppcg grid to allcate the buffer to shared memory. #pragma hmppcg g r i d shared buffer name 3. Through the configuration of L1 cache preference, there is obvious improvement of performance as described in the section 5. cudadevicesetcacheconfig ( args ) ; a r g s = cudafunccachepreferl1 ; 5. EXPERIMENTAL RESULTS Table 1: Consumed time (ms) C code Manual CUDA HMPP CUDA ESAW ESMP Figure 2: Performance of ESAW and ESMP. 5.1 Experiment Environment Our experiment environment is configured as follow: CPU: Inter Xeon W3540 (8 cores, 2.93GHz) GPU: Nvidia Quadro FX 3700 (128 cuda cores, 1 GB memory) OS: ubuntu LTS CUDA: v4.2.9 HMPP-Workbench: v Consumed Time Measure We test the ESAW and ESMP algorithms in such three implementations: C code Manual written CUDA code HMPP CUDA code And we set the the iteration times = 9 in both ESAW and ESMP algorithms, keep the others parameters default. From the experimental results as described in the Table 1 and Figure 2, we can see that both manual CUDA and HMPP CUDA have the obvious speed-up compared with C code implementation of ESAW and ESMP algorithms. The manual CUDA obtains 18.5x to 28.3x speed-up, corresponding with HMPP CUDA s 12.1x to 19.5x speed-up. After analyzing the difference performance between the manual CUDA and HMPP CUDA, there are two main factors: Different transferring between GPU and host Different occupancy of GPU By the Nvidia CUDA profiler, both the time consumed of manual CUDA implementation and HMPP CUDA implementation are classified into three groups: 1. H2D (transferring data from host to device like GPU), 2. kernel (functions executed on GPU to realize the algorithms), 3. D2H (transferring data from device to host). As described in the Figure 3, we can see that the manual CUDA cost two much time on D2H(almost 50% time consumed total) which even close to the kernel consumed time.
5 Figure 3: Performance Details (H2D (transferring data from host to device like GPU), kernel (functions executed on GPU to realize the algorithms), D2H (transferring data from device to host)) Figure 4: Occupancy details of manual and HMPP CUDA for ESAW. 5.3 Occupancy As discussed in the Section 2, occupancy is the key factor to determine whether these kernels can fully saturate the computing device. According to the Nvidia profiler, the manual CUDA didn t make use of the GPU computing unit very well as shown in the Figure 4 and 5. Most of time, the manual CUDA only take 33% or 66% occupancy of our GPU device. And the HMPP CUDA take 66% occupancy in the most time which is better than the manual CUDA. And the same kernel maybe has the different occupancy on different device. This is the problems of determining the best code transformation at a high level compiler like HMPP. 5.4 Optimized HMPP Transformation Because the configuration of L1 cache preference need the compute capability 2.x, but our GPU-Quadro FX 3700 only has compute capability 1.1. So we change to another GPU platform GeForce GTX 550 Ti for the esperiment of Optimized HMPP. For the ESAW algorithm, we get the result of 60ms for manual CUDA and 80ms for HMPP CUDA. It mean that the HMPP CUDA can obtain the close performance as manual CUDA even have the chance to exceed it. The millions of disparity per second(mds) is another criterion of real-time stereo vision system. The optimized HMPP CUDA can achieve 79.5 MDS which meets the demand of real-time stereo vision. 5.5 Comparison of Disparity Map We use the two series image: Cones and Teddy for the test as shown in the Figure 6, and executed using GPU, we get the final visual disparity map of the manual CUDA and HMPP CUDA separately as shown in the Figure 7. There are little difference between the two version Disparity Map and the error rate of Disparity Map are as follow:(waiting the result) Figure 5: Occupancy details of manual and HMPP CUDA for ESMP. 6. CONCLUSION In this work, we use the directive-based HMPP Workbench to generate the CUDA kernel code and optimized the HMPP directive for bettter performance. With HMP- P, we were able to quickly generate different versions of portable code on different device NVIDIA GPU (CUDA target), AMD GPU (OpenCL), Intel MIC(coming soon). The experimental results show that the default and optimized HMPP have the approximative performance compared with CUDA implementation. And the HMPP workbench can greatly reduce the time of application development using parallel computing device.
6 Figure 6: Source image of Cones and Teddy. [8] Qualcomm. snapdragon. [9] Wei Yu, Tsuhan Chen, and J.C. Hoe. Real time stereo vision using exponential step cost aggregation on gpu. In Image Processing (ICIP), th IEEE International Conference on, pages , nov [10] Wei Yu, Tsuhan Chen, F. Franchetti, and J.C. Hoe. High performance stereo vision designed for massively data parallel platforms. Circuits and Systems for Video Technology, IEEE Transactions on, 20(11): , nov Figure 7: ESAW. 7. REFERENCES [1] CAPS. Hmpp. [2] S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto-tuning a high-level language targeted to gpu codes [3] Jian Sun, Nan-Ning Zheng, and Heung-Yeung Shum. Stereo matching using belief propagation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 25(7): , july [4] Jinglin ZHANG, Jean-Francois NEZAN, and Jean-Gabriel COUSIN. Parallel implementation of motion estimation based on heterogeneous parallel computing with opencl. In High Performance Computing And Communications (HPCC), 2012 IEEE 14th International Conference on, [5] Altera. Opencl. [6] Jianbin Fang, A.L. Varbanescu, and H. Sips. An auto-tuning solution to data streams clustering in opencl. In Computational Science and Engineering (CSE), 2011 IEEE 14th International Conference on, pages , aug [7] Nvidia. Tegra3.
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Mobility management and vertical handover decision making in heterogeneous wireless networks
Mobility management and vertical handover decision making in heterogeneous wireless networks Mariem Zekri To cite this version: Mariem Zekri. Mobility management and vertical handover decision making in
ibalance-abf: a Smartphone-Based Audio-Biofeedback Balance System
ibalance-abf: a Smartphone-Based Audio-Biofeedback Balance System Céline Franco, Anthony Fleury, Pierre-Yves Guméry, Bruno Diot, Jacques Demongeot, Nicolas Vuillerme To cite this version: Céline Franco,
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
A graph based framework for the definition of tools dealing with sparse and irregular distributed data-structures
A graph based framework for the definition of tools dealing with sparse and irregular distributed data-structures Serge Chaumette, Jean-Michel Lepine, Franck Rubi To cite this version: Serge Chaumette,
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
Territorial Intelligence and Innovation for the Socio-Ecological Transition
Territorial Intelligence and Innovation for the Socio-Ecological Transition Jean-Jacques Girardot, Evelyne Brunau To cite this version: Jean-Jacques Girardot, Evelyne Brunau. Territorial Intelligence and
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Study on Cloud Service Mode of Agricultural Information Institutions
Study on Cloud Service Mode of Agricultural Information Institutions Xiaorong Yang, Nengfu Xie, Dan Wang, Lihua Jiang To cite this version: Xiaorong Yang, Nengfu Xie, Dan Wang, Lihua Jiang. Study on Cloud
Intelligent Heuristic Construction with Active Learning
Intelligent Heuristic Construction with Active Learning William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, Hugh Leather E H U N I V E R S I T Y T O H F G R E D I N B U Space is BIG! Hubble Ultra-Deep Field
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
Introduction to GPGPU. Tiziano Diamanti [email protected]
[email protected] Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
Cobi: Communitysourcing Large-Scale Conference Scheduling
Cobi: Communitysourcing Large-Scale Conference Scheduling Haoqi Zhang, Paul André, Lydia Chilton, Juho Kim, Steven Dow, Robert Miller, Wendy E. Mackay, Michel Beaudouin-Lafon To cite this version: Haoqi
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
Flauncher and DVMS Deploying and Scheduling Thousands of Virtual Machines on Hundreds of Nodes Distributed Geographically
Flauncher and Deploying and Scheduling Thousands of Virtual Machines on Hundreds of Nodes Distributed Geographically Daniel Balouek, Adrien Lèbre, Flavien Quesnel To cite this version: Daniel Balouek,
NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
Managing Adaptability in Heterogeneous Architectures through Performance Monitoring and Prediction
Managing Adaptability in Heterogeneous Architectures through Performance Monitoring and Prediction Cristina Silvano [email protected] Politecnico di Milano HiPEAC CSW Athens 2014 Motivations System
Xeon+FPGA Platform for the Data Center
Xeon+FPGA Platform for the Data Center ISCA/CARL 2015 PK Gupta, Director of Cloud Platform Technology, DCG/CPG Overview Data Center and Workloads Xeon+FPGA Accelerator Platform Applications and Eco-system
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected]
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected] Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
OpenACC Basics Directive-based GPGPU Programming
OpenACC Basics Directive-based GPGPU Programming Sandra Wienke, M.Sc. [email protected] Center for Computing and Communication RWTH Aachen University Rechen- und Kommunikationszentrum (RZ) PPCES,
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Embedded Systems: map to FPGA, GPU, CPU?
Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven [email protected] Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware
Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015
INF5063: Programming heterogeneous multi-core processors because the OS-course is just to easy! Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks October 20 th 2015 Håkon Kvale
A general-purpose virtualization service for HPC on cloud computing: an application to GPUs
A general-purpose virtualization service for HPC on cloud computing: an application to GPUs R.Montella, G.Coviello, G.Giunta* G. Laccetti #, F. Isaila, J. Garcia Blas *Department of Applied Science University
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
Case Study on Productivity and Performance of GPGPUs
Case Study on Productivity and Performance of GPGPUs Sandra Wienke [email protected] ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
QASM: a Q&A Social Media System Based on Social Semantics
QASM: a Q&A Social Media System Based on Social Semantics Zide Meng, Fabien Gandon, Catherine Faron-Zucker To cite this version: Zide Meng, Fabien Gandon, Catherine Faron-Zucker. QASM: a Q&A Social Media
Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter
Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná - Brazil Hochschule Regensburg 02.05.2011 Daniel
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
Parallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
Texture Cache Approximation on GPUs
Texture Cache Approximation on GPUs Mark Sutherland Joshua San Miguel Natalie Enright Jerger {suther68,enright}@ece.utoronto.ca, [email protected] 1 Our Contribution GPU Core Cache Cache
The truck scheduling problem at cross-docking terminals
The truck scheduling problem at cross-docking terminals Lotte Berghman,, Roel Leus, Pierre Lopez To cite this version: Lotte Berghman,, Roel Leus, Pierre Lopez. The truck scheduling problem at cross-docking
Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
Novel Client Booking System in KLCC Twin Tower Bridge
Novel Client Booking System in KLCC Twin Tower Bridge Hossein Ameri Mahabadi, Reza Ameri To cite this version: Hossein Ameri Mahabadi, Reza Ameri. Novel Client Booking System in KLCC Twin Tower Bridge.
Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software
GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
Accelerating CFD using OpenFOAM with GPUs
Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide
Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
Introduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
GeoImaging Accelerator Pansharp Test Results
GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance
A usage coverage based approach for assessing product family design
A usage coverage based approach for assessing product family design Jiliang Wang To cite this version: Jiliang Wang. A usage coverage based approach for assessing product family design. Other. Ecole Centrale
Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute [email protected].
Medical Image Processing on the GPU Past, Present and Future Anders Eklund, PhD Virginia Tech Carilion Research Institute [email protected] Outline Motivation why do we need GPUs? Past - how was GPU programming
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)
PARALLEL JAVASCRIPT Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) JAVASCRIPT Not connected with Java Scheme and self (dressed in c clothing) Lots of design errors (like automatic semicolon
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING
ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,
VR4D: An Immersive and Collaborative Experience to Improve the Interior Design Process
VR4D: An Immersive and Collaborative Experience to Improve the Interior Design Process Amine Chellali, Frederic Jourdan, Cédric Dumas To cite this version: Amine Chellali, Frederic Jourdan, Cédric Dumas.
Clustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu [email protected] Bin Zhang [email protected] Meichun Hsu [email protected] ABSTRACT In this paper, we report our research on using GPUs to accelerate
Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.
Parallel Computing: Strategies and Implications Dori Exterman CTO IncrediBuild. In this session we will discuss Multi-threaded vs. Multi-Process Choosing between Multi-Core or Multi- Threaded development
High Performance GPGPU Computer for Embedded Systems
High Performance GPGPU Computer for Embedded Systems Author: Dan Mor, Aitech Product Manager September 2015 Contents 1. Introduction... 3 2. Existing Challenges in Modern Embedded Systems... 3 2.1. Not
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
GPU for Scientific Computing. -Ali Saleh
1 GPU for Scientific Computing -Ali Saleh Contents Introduction What is GPU GPU for Scientific Computing K-Means Clustering K-nearest Neighbours When to use GPU and when not Commercial Programming GPU
Exploiting GPU Hardware Saturation for Fast Compiler Optimization
Exploiting GPU Hardware Saturation for Fast Compiler Optimization Alberto Magni School of Informatics University of Edinburgh United Kingdom [email protected] Christophe Dubach School of Informatics
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Performance Evaluation of Encryption Algorithms Key Length Size on Web Browsers
Performance Evaluation of Encryption Algorithms Key Length Size on Web Browsers Syed Zulkarnain Syed Idrus, Syed Alwee Aljunid, Salina Mohd Asi, Suhizaz Sudin To cite this version: Syed Zulkarnain Syed
Online vehicle routing and scheduling with continuous vehicle tracking
Online vehicle routing and scheduling with continuous vehicle tracking Jean Respen, Nicolas Zufferey, Jean-Yves Potvin To cite this version: Jean Respen, Nicolas Zufferey, Jean-Yves Potvin. Online vehicle
GPGPU Computing. Yong Cao
GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens
The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA
The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 [email protected] THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Experiences on using GPU accelerators for data analysis in ROOT/RooFit
Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,
OpenACC 2.0 and the PGI Accelerator Compilers
OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group [email protected] This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present
COSCO 2015 Heterogeneous Computing Programming
COSCO 2015 Heterogeneous Computing Programming Michael Meyer, Shunsuke Ishikuro Supporters: Kazuaki Sasamoto, Ryunosuke Murakami July 24th, 2015 Heterogeneous Computing Programming 1. Overview 2. Methodology
Learn CUDA in an Afternoon: Hands-on Practical Exercises
Learn CUDA in an Afternoon: Hands-on Practical Exercises Alan Gray and James Perry, EPCC, The University of Edinburgh Introduction This document forms the hands-on practical component of the Learn CUDA
GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
Data Center and Cloud Computing Market Landscape and Challenges
Data Center and Cloud Computing Market Landscape and Challenges Manoj Roge, Director Wired & Data Center Solutions Xilinx Inc. #OpenPOWERSummit 1 Outline Data Center Trends Technology Challenges Solution
Enhancing Cloud-based Servers by GPU/CPU Virtualization Management
Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department
IntroClassJava: A Benchmark of 297 Small and Buggy Java Programs
IntroClassJava: A Benchmark of 297 Small and Buggy Java Programs Thomas Durieux, Martin Monperrus To cite this version: Thomas Durieux, Martin Monperrus. IntroClassJava: A Benchmark of 297 Small and Buggy
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
NVIDIA GeForce GTX 580 GPU Datasheet
NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
Real-time Visual Tracker by Stream Processing
Real-time Visual Tracker by Stream Processing Simultaneous and Fast 3D Tracking of Multiple Faces in Video Sequences by Using a Particle Filter Oscar Mateo Lozano & Kuzahiro Otsuka presented by Piotr Rudol
