Sélection adaptative de codes polyédriques pour GPU/CPU

Similar documents

sept-2002 Computer architecture and software cells for broadband networks Va avec

CUDA Basics. Murphy Stein New York University

Retour vers le futur des bibliothèques de squelettes algorithmiques et DSL

Sur 1 Bit bit n 4 bit n 3 bit n 2 bit n 1 bit n 4 bit n 3 bit n 2 bit n 1 bit n 4 bit n 3 bit n 2 bit n 1

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

GPU Parallel Computing Architecture and CUDA Programming Model

Personnalisez votre intérieur avec les revêtements imprimés ALYOS design

1. If we need to use each thread to calculate one output element of a vector addition, what would

Hands-on CUDA exercises

Remote Method Invocation

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Langages Orientés Objet Java

Advanced CUDA Webinar. Memory Optimizations

Introduction au BIM. ESEB Seyssinet-Pariset Economie de la construction contact@eseb.fr

Sun Management Center Change Manager Release Notes

SunFDDI 6.0 on the Sun Enterprise Server

HPC with Multicore and GPUs

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Calcul parallèle avec R

Introduction to GPU Programming Languages

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Introduction to GPU hardware and to CUDA

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Solaris 10 Documentation README

Note concernant votre accord de souscription au service «Trusted Certificate Service» (TCS)

Cours de Java. Sciences-U Lyon. Java - Introduction Java - Fondamentaux Java Avancé.

Introduction to CUDA C

Optimizing and interfacing with Cython. Konrad HINSEN Centre de Biophysique Moléculaire (Orléans) and Synchrotron Soleil (St Aubin)

Audit de sécurité avec Backtrack 5

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Introduction ToIP/Asterisk Quelques applications Trixbox/FOP Autres distributions Conclusion. Asterisk et la ToIP. Projet tuteuré

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

Optimizing Solaris Resources Through Load Balancing

Annual Event 2016 Workshop New to Interreg, where to begin? Évènement annuel 2016 Atelier «Interreg pour les débutants, par où commencer?

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Machine de Soufflage defibre

Distributed Application Management using Jini Connection Technology

CUDA Programming. Week 4. Shared memory and register

Image Processing & Video Algorithms with CUDA

POB-JAVA Documentation

Parallel Computing in Python: multiprocessing. Konrad HINSEN Centre de Biophysique Moléculaire (Orléans) and Synchrotron Soleil (St Aubin)

"Internationalization vs. Localization: The Translation of Videogame Advertising"

Technical Service Bulletin

Sun Grid Engine Release Notes

Memo bconsole. Memo bconsole

Memory Eye SSTIC Yoann Guillot. Sogeti / ESEC R&D yoann.guillot(at)sogeti.com

niveau : 1 ere année spécialité : mécatronique & froid et climatisation AU : Programmation C Travaux pratiques

Troncatures dans les modèles linéaires simples et à effets mixtes sous R

Calcul Parallèle sous MATLAB

Vincent Rullier Technology specialist Microsoft Suisse Romande

Short Form Description / Sommaire: Carrying on a prescribed activity without or contrary to a licence

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Thursday, February 7, DOM via PHP

Sun StorEdge A5000 Installation Guide

N1 Grid Service Provisioning System 5.0 User s Guide for the Linux Plug-In

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Solaris Resource Manager

Licence Informatique Année Exceptions

Méthodes ensemblistes pour une localisation robuste de robots sous-marins

Stockage distribué sous Linux

Optimizing Application Performance with CUDA Profiling Tools

Administrer les solutions Citrix XenApp et XenDesktop 7.6 CXD-203

A Strategy for Managing Performance

AD511 Active Iridium Antenna User Manual Mar 12 V4.0

BILL C-665 PROJET DE LOI C-665 C-665 C-665 HOUSE OF COMMONS OF CANADA CHAMBRE DES COMMUNES DU CANADA

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

Liste d'adresses URL

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Next Generation GPU Architecture Code-named Fermi

Sun Management Center 3.5 Update 1b Release Notes

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Parallel Programming Survey

Part I Courses Syllabus

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Open call for tenders n SCIC C4 2014/01

ARE NEW OIL PIPELINES AND TANKER FACILITIES VIABLE IN CANADA?

VETERINARY HEALTH CERTIFICATE EXPORT OF MAMMALS (except Rodents and Lagomorpha) TO JAPAN

Lecture 1: an introduction to CUDA

TP1 : Correction. Rappels : Stream, Thread et Socket TCP

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

GPGPU Parallel Merge Sort Algorithm

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Dynamic Case-Based Reasoning Based on the Multi-Agent Systems: Individualized Follow-Up of Learners in Distance Learning

Guidance on Extended Producer Responsibility (EPR) Analysis of EPR schemes in the EU and development of guiding principles for their functioning

Transcription:

Sélection adaptative de codes polyédriques pour GPU/CPU Jean-François DOLLINGER, Vincent LOECHNER, Philippe CLAUSS INRIA - Équipe CAMUS Université de Strasbourg Saint-Hippolyte - Le 6 décembre 2011 1

Sommaire 1 Introduction 2 Framework Pedro-CUDA Génération de code CUDA Profilage Prédiction 3 Expériences 4 Conclusion 2

Introduction Framework Pedro [Pradelle et al. 2011] Multi-versioning de nids de boucles Cible : CPU multi-cœurs Sélection de code à la volée Génération de code statique Profilage hors-ligne Sélection en ligne 3

Introduction Table de classement Nombre de coeurs version 1 version 2 version 3 1 40 ms 55 ms 32 ms 2 32 ms 28 ms 17 ms 3 22 ms 15 ms 9 ms 4 14 ms 7 ms 8 ms Calcul du temps d exécution cnt = {2000, 600, 300, 300} tpred = ((2000 600) 40) + ((600 300) 32) + (0 22) + (300 14) = 69800ms 4

Introduction Idée : Framework Pedro-CUDA Extension aux GPU et CPU + GPU Coûts de transferts Temps de calculs (difficilement prédictibles) HMPP prévoit des optimisations Fortement couplées à l architecture cible Exemple de directives HMPP #pragma hmpp... cond= (N > 1000) 5

Introduction Phase de compilation Code source Compilateur Versions de code CPU + GPU Phase de profilage Profileur Table de classement et table de débits Phase de sélection Contexte d'exécution Prédicteur Code binaire de l'application Version 1...... Version sélectionnée Version M 6

Introduction Architecture GPU Modèle mémoire Mémoire globale Mémoire partagée Registres Mémoires constantes et textures Modèle processeur Grille de blocs de threads GPU exécutent des grilles Streaming Multiprocessors exécutent des blocs (SIMD) Streaming Processors exécutent des threads 7

Sommaire 1 Introduction 2 Framework Pedro-CUDA Génération de code CUDA Profilage Prédiction 3 Expériences 4 Conclusion 8

Framework Pedro-CUDA Génération de code CUDA PLuTo [Bondhugula et al. 2008, Baskaran et al. 2008] Générateur C-vers-CUDA non fonctionnel Assertions invalides Configuration de l exécution grossière Validité du code produit? Solution : Adapter un code CPU Génération par des scripts Python Code kernel Code host Version de code Fichier CLooG Dimensions parallèles et tailles de blocs 9

Framework Pedro-CUDA Génération de code CUDA Code host Boucles externes séquentielles copiées Appels kernels synchrones Transferts de données à l aide de cudamemcpy() Code device Boucles parallèles Transformées en gardes Mappées sur la grille de threads CUDA Identifiant de thread affecté à l indice original Boucles internes copiées dans le kernel 10

Framework Pedro-CUDA Génération de code CUDA version host kernel for(t1...) for(t2...) //parallel for(t3...) //parallel for(t4...) for(t5...) S; cudamemcpy(h2d) for(t1...) { cudamemcpy(h2d) kernel<<...>>(); cudamemcpy(d2h) } t2 = f(threadidx.x); t3 = f(threadidx.y); if(t2...) if(t3...) for(t4...) for(t5...) S; 11

Sommaire 1 Introduction 2 Framework Pedro-CUDA Génération de code CUDA Profilage Prédiction 3 Expériences 4 Conclusion 12

Framework Pedro-CUDA Profilage Transferts mémoire host device Mesures pour certaines tailles Interpolation à l exécution 6000 Memory copies between host and device - Nvidia Quadro 4000 5000 4000 Bandwidth (MB/s) 3000 2000 1000 0 2 3 4 5 6 7 8 9 log10(message size (bytes)) cudamemcpy(cudamemcpyhosttodevice) cudamemcpy(cudamemcpydevicetohost) 13

Framework Pedro-CUDA Profilage Évaluation offline du code Exécuté sur la machine cible Calcul des paramètres des boucles parallèles nok = t r u e w h i l e ( nok ) { // a j u s t e m e n t des p a r a m e t r e s 1 a d j u s t p a r a m s (+1); f o r e a c h ( p a r a l l e l d i m ) { i t = e h r h a r t ( p a r a l l e l d i m ) i f ( i t > g r i d s z ( p a r a l l e l d i m ) ) { nok = nok & f a l s e ; } e l s e { nok = nok & true ; } } } adjust params ( 1); 14

Framework Pedro-CUDA Profilage Calcul du temps par itération f o r ( ; ; ) { // a j u s t e m e n t p a r a m e t r e s 2 a d j u s t p a r a m s ( ) ; // i n i t i a l i s a t i o n des p o i n t e u r s i n i t p o i n t e r s ( ) ; t a b s = gettime ( ) k e r n e l <<b l o c k s z, g r i d s z >>(); t abs = gettime ( ) t abs ; n b i t e r = e h r h a r t ( ) ; t i t = t a b s / n b i t e r ; } i f ( measure ok ( t a b s ) ) { // mesure f i a b l e? r e t u r n t i t ; } 15

Framework Pedro-CUDA Profilage Temps d exécution par itération 2 1.5 Kernel launch (matmul no optimization on 2D grid) 8x2 blocks 8x4 blocks 8x8 blocks 16x4 blocks 16x16 blocks log10(time (ns)) per iteration 1 0.5 0-0.5 0 50 100 150 200 250 300 350 400 number of blocks 16

Framework Pedro-CUDA Profilage Table de classement Temps d exécution par itération Interpolation constante après le seuil Mesures Nombre de blocs incrémenté de 1 Arrêt à un seuil arbitraire Table de bande passante Bande passante host device Mesures Taille des données transférées multipliée par deux Jusqu à saturation de la mémoire disponible 17

Sommaire 1 Introduction 2 Framework Pedro-CUDA Génération de code CUDA Profilage Prédiction 3 Expériences 4 Conclusion 18

Framework Pedro-CUDA Prédiction Nids de prédiction Nids de boucles simplifiés Temps d exécution et temps de transfert totaux approximés f o r e a c h ( v e r s i o n ) { // e s t i m a t i o n du temps de t r a n s f e r t pour l e n i d de b o u c l e s // host to d e v i c e f o r ( t1... ) { // e s t i m a t i o n du temps de t r a n s f e r t par a p p e l k e r n e l // host to d e v i c e e t d e v i c e to h o s t } } // e s t i m a t i o n du temps d e x e c u t i o n du k e r n e l 19

Framework Pedro-CUDA Prédiction Exécution de la version estimée être la meilleure Hopla Geiss! 20

Sommaire 1 Introduction 2 Framework Pedro-CUDA Génération de code CUDA Profilage Prédiction 3 Expériences 4 Conclusion 21

Expériences Prédiction des temps d exécution Time in seconds 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0 matmul4 matmul3 matmul2 matmul1 gemm4 gemm3 gemm2 gemm1 covariance3 covariance2 covariance1 mat-init2 mat-init1 predicted time execution time 22

Expériences Overhead de la prédiction 1.6e-05 1.4e-05 1.2e-05 Time in seconds 1e-05 8e-06 6e-06 4e-06 2e-06 matmul4 matmul3 matmul2 matmul1 gemm4 gemm3 gemm2 gemm1 covariance4 covariance3 covariance2 covariance1 0 matinit2 matinit1 overhead 23

Sommaire 1 Introduction 2 Framework Pedro-CUDA Génération de code CUDA Profilage Prédiction 3 Expériences 4 Conclusion 24

Conclusion Travail réalisé Ce qui a été fait Méthode de prédiction pour GPU et CPU + GPU Génération de code avec scripts python Codes de profilage et de prédiction Ce qui marche déjà Profilage et prédiction 25

Conclusion Perspectives Générateur CUDA Gagner en flexibilité Utilisation de codes adaptés aux GPU Intégration CPU vs GPU Choix d une version CPU si non-performante sur GPU Expériences et consolidation de la méthode Amélioration du profilage 26

Sélection adaptative de codes polyédriques pour GPU/CPU Jean-François DOLLINGER, Vincent LOECHNER, Philippe CLAUSS INRIA - Équipe CAMUS Université de Strasbourg Saint-Hippolyte - Le 6 décembre 2011 27

Périphérique Stream multiprocessor N Stream multiprocessor 2 Stream multiprocessor 1 Registres Registres Registres Mémoire de constantes Mémoire de textures Mémoire globale 28

y blockdim.x griddim.y Block (0, 2) Block (1, 2) blockdim.y threadidx.y blockidx.y Block (0, 0) Block (1, 0) blockidx.x griddim.x x threadidx.x 29