DARPA, NSF-NGS/ITR,ACR,CPA,

Similar documents

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute

Image Compression through DCT and Huffman Coding Technique

How to Write Fast Code SIMD Vectorization , spring th and 14 th Lecture

A Multi-layered Domain-specific Language for Stencil Computations

PARALLEL PROGRAMMING

Data Analysis with MATLAB The MathWorks, Inc. 1

5MD00. Assignment Introduction. Luc Waeijen

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms

Mathematical Libraries on JUQUEEN. JSC Training Course

How To Write Fast Numerical Code: A Small Introduction

Polytechnic University of Puerto Rico Department of Electrical Engineering Master s Degree in Electrical Engineering.

Spring 2011 Prof. Hyesoon Kim

Trends in High-Performance Computing for Power Grid Applications

Performance Analysis and Optimization Tool

OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old

ultra fast SOM using CUDA

High Performance Computing in CST STUDIO SUITE

2IP WP8 Materiel Science Activity report March 6, 2013

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Multicore Parallel Computing with OpenMP

Overview of HPC Resources at Vanderbilt

Part I Courses Syllabus

Control 2004, University of Bath, UK, September 2004

Introduction to MATLAB Gergely Somlay Application Engineer

A Review of Customized Dynamic Load Balancing for a Network of Workstations

RevoScaleR Speed and Scalability

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

Retargeting PLAPACK to Clusters with Hardware Accelerators

Mathematical Libraries and Application Software on JUROPA and JUQUEEN

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Petascale Software Challenges. William Gropp

Bringing Big Data Modelling into the Hands of Domain Experts

WESTMORELAND COUNTY PUBLIC SCHOOLS Integrated Instructional Pacing Guide and Checklist Computer Math

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Exploiting GPU Hardware Saturation for Fast Compiler Optimization

Four Keys to Successful Multicore Optimization for Machine Vision. White Paper

1.4 Fast Fourier Transform (FFT) Algorithm

Scalable Developments for Big Data Analytics in Remote Sensing

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

The Scientific Data Mining Process

CS 575 Parallel Processing

HPC Wales Skills Academy Course Catalogue 2015

Data-Flow Awareness in Parallel Data Processing

Lesson 7: SYSTEM-ON. SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY. Chapter-1L07: "Embedded Systems - ", Raj Kamal, Publs.: McGraw-Hill Education

What s New in MATLAB and Simulink

HPC enabling of OpenFOAM R for CFD applications

2: Computer Performance

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Recent Advances in Periscope for Performance Analysis and Tuning

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

A Pattern-Based Approach to. Automated Application Performance Analysis

Accelerating Fast Fourier Transforms Using Hadoop and CUDA

An Introduction to Parallel Computing/ Programming

GPU-based Decompression for Medical Imaging Applications

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Multicore Programming with LabVIEW Technical Resource Guide

Operation Count; Numerical Linear Algebra

Functional Data Analysis of MALDI TOF Protein Spectra

Introduction to MATLAB for Data Analysis and Visualization

GPGPU accelerated Computational Fluid Dynamics

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

Speeding Up RSA Encryption Using GPU Parallelization

MPI and Hybrid Programming Models. William Gropp

Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

GeoImaging Accelerator Pansharp Test Results

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

HPC with Multicore and GPUs

Introduction to Cloud Computing

Best practices for efficient HPC performance with large models

GPUs for Scientific Computing

(!' ) "' # "*# "!(!' +,

Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013

Speed up numerical analysis with MATLAB

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

Programming Languages & Tools

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Distributed communication-aware load balancing with TreeMatch in Charm++

THE NAS KERNEL BENCHMARK PROGRAM

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

A HIGH PERFORMANCE SOFTWARE IMPLEMENTATION OF MPEG AUDIO ENCODER. Figure 1. Basic structure of an encoder.

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Transcription:

Spiral Automating Library Development Markus Püschel and the Spiral team (only part shown) With: Srinivas Chellappa Frédéric de Mesmay Franz Franchetti Daniel McFarlin Yevgen Voronenko Electrical and Computer Engineering Carnegie Mellon University This work was supported by DARPA, NSF-NGS/ITR,ACR,CPA, Intel, Mercury, National Instruments

Positions and Thoughts Autotuning definition Search over space of alternatives and Parameter-based tuning are very important but fails to address some key problems; we need to think about Raising the level of abstraction: Enables Use of domain knowledge Difficult optimizations: parallelization, vectorization, etc. Faster porting to new platforms and platform paradigms Possibly automatic software development We need coarse platform abstractions We need more interdisciplinary collaborations Metrics Time for code development, porting to new platforms Performance

DFT Plot: Analysis Multiple threads: 2x Memory hierarchy: 5x Vector instructions: 3x High performance library development has become a nightmare

Spiral Research Goal: Teach computers to write fast libraries Complete automation of implementation and optimization Including vectorization, parallelization Functionality: Linear transforms (discrete Fourier transform, filters, wavelets) BLAS SAR imaging En/decoding (Viterbi, Ebcot in JPEG2000) more Platforms: Desktop (vector, SMP), FPGAs, GPUs, distributed, hybrid Collaboration with Intel (Kuck, Tang, Sabanin) Parts of MKL/IPP generated with Spiral IPP 6.0: ippg domain for Spiral generated code

human effort automated Carnegie Mellon automated Vision Behind Spiral Current Numerical problem Future Numerical problem C program algorithm selection implementation algorithm selection implementation compilation compilation Computing platform C code a singularity: Compiler has no access to high level information Computing platform Challenge: conquer the high abstraction level for complete automation

Organization Spiral s framework: Example transforms Complete automation achieved Beyond transforms Conclusions and thoughts

Linear Transforms Mathematically: Matrix-vector multiplication Output vector Transform = matrix Input vector Example: Discrete Fourier transform (DFT)

Transform Algorithms: Example 4-point FFT Cooley/Tukey fast Fourier transform (FFT): 1 1 1 1 1 1 1 1 1 1 1 j 1 j 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 j 1 j 1 1 j 1 1 1 Fourier transform Diagonal matrix (twiddles) Kronecker product Identity Permutation Algorithms are divide-and-conquer: Breakdown rules Mathematical, declarative representation: SPL (signal processing language) SPL describes the structure of the dataflow

Breakdown Rules (>200 for >50 Transforms) Combining these rules yields many algorithms for every given transform

SPL to Sequential Code Example: tensor product Correct code: easy fast code: very difficult

Program Generation in Spiral (Sketched) Transform user specified Fast algorithm in SPL many choices -SPL: [PLDI 05] Optimization at all abstraction levels parallelization vectorization loop optimizations C Code: Iteration of this process constant folding to search for the fastest scheduling But that s not all

SPL to Shared Memory Code: Basic Idea [SC 06] Governing construct: tensor product A A A A Processor 0 Processor 1 Processor 2 Processor 3 p-way embarrassingly parallel, load-balanced Problematic construct: permutations produce false sharing x y Task: Rewrite formulas to extract tensor product + keep contiguous blocks x y

Parallelization by Rewriting coarse platform model Load-balanced No false sharing

Same Approach for Other Parallel Paradigms Message Passing Vectorization MPI Cg/OpenGL for GPUs: Verilog for FPGAs:

Example Results CPU+FPGA CPU CPU + GPU CELL GPU

Summary: Complete Automation for Transforms Platform: Off-the-shelf desktop Often: generated code faster than competition (if exists) Memory hierarchy optimization Rewriting and search for algorithm selection Rewriting for loop optimizations Vectorization Rewriting Parallelization Rewriting fixed input size code Derivation of library structure Rewriting Other methods general input size library

Generated Libraries RDFT DHT DCT2 DCT3 DCT4 DFT 2-way vectorized, 2-threaded Most are faster than hand-written libs Recursion steps: 4 17 Code size: 8 120 kloc or 0.5 5 MB Generation time: 1 3 hours Filter Wavelet

Organization Spiral s framework: Example transforms Complete automation achieved Beyond transforms Operator language BLAS, Viterbi decoding, SAR imaging, Ebcot encoding Conclusions and thoughts

Going Beyond Transforms Transform = linear operator with one vector input and one vector output linear Key ideas: Generalize to (possibly nonlinear) operators with several inputs and several outputs Generalize SPL (including tensor product) to OL (operator language) Generalize rewriting systems for parallelizations

Operator Language Carnegie Mellon

Example: Matrix Multiplication (MMM) Breakdown rules = algorithm knowledge: capture various forms of blocking

Parallelization through rewriting Load-balanced No false sharing

Speed-up (m x k) times (k x n) example: k 35 30 25 20 15 10 5 MMM Speedup over MKL, n = 32, Core 2 Duo 1.0 1.5x 1.5 2.0x 'ratio32' u 1:2:3 4 3.5 3 2.5 2 1.5 1 0.5 0 0 m Comparison to GotoBLAS similar 0 5 10 15 20 25 30 35

Viterbi Decoding in OL http://www.ece.unb.ca/tervo/ee4253/convolution3.htm Operator for Viterbi decoder Breakdown rules

Results Karn s implementation: hand-written assembly for 4 Viterbi codes Spiral 16-way 8-way 4-way scalar Karn 16-way 8-way 4-way scalar

EBCOT Coding in OL Carnegie Mellon

Organization Spiral s framework: Example transforms Complete automation achieved Beyond transforms Conclusions and thoughts

Raising the Abstraction Level Formally describe and structure algorithms/applications eternally valid In Spiral Domain-specific, declarative, mathematical language OL Difficult optimizations/transformations by rewriting What it enables Vectorization, parallelization using domain knowledge Efficient retargeting to new platforms and new platform paradigms Complete automation in some cases Other examples Libraries Identification and definition of BLAS Parameter tuning Indispensable tool but cannot achieve the above

Interdisciplinary Research Needed Programming languages Program generation Symbolic Computation Rewriting Software Scientific Computing Automating High-Performance Parallel Library Development Algorithms Mathematics Compilers We Need to Work Together