Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems



Similar documents
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Accelerating CFD using OpenFOAM with GPUs

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

Retargeting PLAPACK to Clusters with Hardware Accelerators

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

High Performance Matrix Inversion with Several GPUs

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

GPUs for Scientific Computing

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

High Performance Computing in CST STUDIO SUITE

Overview of HPC Resources at Vanderbilt

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

HPC with Multicore and GPUs

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

ST810 Advanced Computing

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Introduction to GPU Programming Languages

Graphic Processing Units: a possible answer to High Performance Computing?

A quick tutorial on Intel's Xeon Phi Coprocessor

GPU-BASED TUNING OF QUANTUM-INSPIRED GENETIC ALGORITHM FOR A COMBINATORIAL OPTIMIZATION PROBLEM

A Flexible Cluster Infrastructure for Systems Research and Software Development

Multicore Parallel Computing with OpenMP

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introduction to GPU hardware and to CUDA

Case Study on Productivity and Performance of GPGPUs

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Parallel Computing for Data Science

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

ultra fast SOM using CUDA

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Turbomachinery CFD on many-core platforms experiences and strategies

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

Resource Scheduling Best Practice in Hybrid Clusters

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Trading Off Performance for Power-Energy in Dense Linear Algebra Operations

Parallel Programming Survey

HPC Wales Skills Academy Course Catalogue 2015

Dense Linear Algebra Solvers for Multicore with GPU Accelerators

Big Data Visualization on the MIC

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

GPU-accelerated Large Scale Analytics using MapReduce Model

In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps. Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

FPGA-based Multithreading for In-Memory Hash Joins

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Parallel Computing. Introduction

~ Greetings from WSU CAPPLab ~

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Part I Courses Syllabus

Next Generation GPU Architecture Code-named Fermi

GPGPU accelerated Computational Fluid Dynamics

Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory. The Nation s Premier Laboratory for Land Forces UNCLASSIFIED

Performance Characteristics of Large SMP Machines

HPC enabling of OpenFOAM R for CFD applications

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Recent and Future Activities in HPC and Scientific Data Management Siegfried Benkner

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

A Case Study - Scaling Legacy Code on Next Generation Platforms

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Analysis and Optimization of a Hybrid Linear Equation Solver using Task-Based Parallel Programming Models

Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators

Simulation Platform Overview

HP ProLiant SL270s Gen8 Server. Evaluation Report

Parallelism and Cloud Computing

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Parallel Computing with MATLAB

Scalability and Classifications

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs

Introduction to GPU Computing

Introduction to GPGPU. Tiziano Diamanti

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

A CP Scheduler for High-Performance Computers

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures

GPGPU acceleration in OpenFOAM

CUDA programming on NVIDIA GPUs

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

COSCO 2015 Heterogeneous Computing Programming

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

Experiences With Mobile Processors for Energy Efficient HPC

Trends in High-Performance Computing for Power Grid Applications

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

CFD Implementation with In-Socket FPGA Accelerators

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

Accelerator Beam Dynamics on Multicore, GPU and MIC Systems. James Amundson, Qiming Lu, and Panagiotis Spentzouris Fermilab

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Alan Kaminsky

Transcription:

Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems Murilo Boratto Núcleo de Arquitetura de Computadores e Sistemas Operacionais, Universidade do Estado da Bahia, Brazil July 2, 2016

Outline Motivation TRSM Auto-Tuning Modelling Experimental Results Perspectives

Objective Auto-tuning: automatically decide how to run a routine for low execution time.

Objective Auto-tuning: automatically decide how to run a routine for low execution time. Predict the routine behavior depending on the problem size and the computational system.

Objective Auto-tuning: automatically decide how to run a routine for low execution time. Predict the routine behavior depending on the problem size and the computational system. Use of parameterized routines and models, with theoretical or empirical estimation of the parameters of the system and selection of the routine parameters at execution time.

Hybrid parallelism Routines combining different sources of parallelism: Multilevel parallelism with OpenMP OpenMP+BLAS+MAGMA parallelism CPU+GPU+Coprocessador parallelism There are possible extensions: MPI+OpenMP+BLAS+MAGMA+Multi-GPU+Multi- Coprocessor... Systems and computation hybrid, heterogeneous, hierarchical.

Applications Collaboration with the Scientific Computing and Parallel Programming (SCPP) group at the University of Murcia: Auto-Tuning of parallel routines and applications of parallelism (http://dis.um.es/ domingo/investigacion.html) In this presentation we summarize our on-going work on auto-tuning a hybrid-parallel linear algebra routine: Triangular Linear Systems Solver (TRSM)

Applications: Linear algebra Basic routines: matrix multiplication, factorizations... with OpenMP+BLAS+MAGMA parallelism to be used in large computational problems (electromagnetism, statistic models...) Related with: J. Cuenca, L. P. García, D. Giménez, F. J. Herrera: Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU+multiGPUs Platforms, CMMSE16. M. Boratto, P. Alonso, D. Giménez, A. Lastovetsky: Automatic Tuning to Performance Modelling of Matrix Polynomials on Multicore and Multi-GPU Systems, CMMSE15, The Journal of Supercomputing.

Computational systems: CPU+GPU+Coprocessor Combination of Multicores + GPU + Coprocessor Preliminary analysis on the combination of the three types of systems Modeling or empirical installation? Goal Auto-Tuning methodology for nested parallelism (OpenMP+BLAS+MAGMA), experiments with the TRSM routine.

TRSM

TRSM The TRSM routine solves the linear equation AX = B, where A is an upper or lower triangular matrix and B is a known matrix, called right-hand side matrix. Implementation with a block algorithm in parallel, with solution of small triangular systems with TRSM and matrix multiplications with GEMM. 00 TRSM 10 GEMM 11 TRSM Distribution of blocks to the computational components with a estructure of tasks. -1 0 0 0 1 2 2 1 T1 T2 2 1 T3 20 GEMM 30 21 GEMM 31 22 TRSM 32 33-1= Task done 0 = Task waiting to be done 1 = Task with single dependency 2 = Task with double dependency Producer Thread Consumer Thread Thread waiting for Task T1 T2 T3 L10 L20 L30 GEMM GEMM GEMM TRSM

Auto-Tuning

Auto-Tuning Auto-Tuning is a software technique that is expected to fulfill its requirements at runtime, in response to changes in the system (different system configurations) or the problem (size and particular input). The principal objectives are: Self-configuration Self-optimization

Auto-Tuning Methodology

Empirical Installation To run some selected executions at installation time: Experiment with parameters of the model A large installation time may be needed For some problem sizes, search for the best parameters combination: exhaustive search / guided search (search in the most promising directions). Experiments with: Some problem sizes: Installation set and values of the algorithmic parameters: Parameters set The equation that predicts the execution time is generated, or the preferred parameters are stored.

Experimental Results

Table: Execution Environment: Hardware and Software Processor IntelRCore (TM) Memory 12GB Clock 3.60GHz Number of Processors 2 Cores per Processor 8 GPU Type NVIDIA TESLA C2070 Number of GPUs 2 CUDA cores per GPU 2496 GPU Memory per GPU 5GB GRR3 Version CUDA 4.0 MIC Intel Xeon Phi Number of MIC 1 Cores per MIC 57

Installation Algorithmic parameters: c: number of threads w: block size Installation set={100, 200, 300, 400, 500} Parameters set={2,4,6,8,10,12,14,16} {16,32,64} The best parameters combination if obtained, (12, 64), and the equation that predicts the execution time is generated

Execution Time Execution time with different parameters values (in seconds). Platform n = 10240 n = 11200 n = 12800 n = 14400 c w time w time w time w time 1 16 97.86 16 125.64 16 193.59 32 270.78 2 16 52.90 16 70.06 32 105.80 32 173.19 4 16 28.79 32 39.23 32 59.22 64 85.46 6 32 17.46 32 22.87 32 33.60 64 48.42 8 32 14.25 32 19.09 32 29.71 64 41.99 10 64 13.24 64 17.85 64 27.87 64 38.93 12 64 12.28 64 14.74 64 23.24 64 32.16 14 64 13.76 64 16.50 64 26.45 64 35.65 16 64 15.04 64 17.73 64 28.04 64 33.86

Perspectives

Perspectives Modelling can help in the auto-tuning of basic parallel routines and scientific codes, so contributing to the efficient use of parallel programs. Hybrid parallelism (different types of parallelism, different paradigms...) introduces additional difficulties. Sometimes the theoretical models are combined with empirical analysis. Satisfactory results with a TRSM routine, but better modeling techniques are needed, especially for complex scientific problems in more complex (hybrid, heterogeneous, hierarchical) systems.