DARPA, NSF-NGS/ITR,ACR,CPA,

Size: px

Start display at page:

Download "DARPA, NSF-NGS/ITR,ACR,CPA,"

Cuthbert Reginald Watts
10 years ago
Views:

Spiral Automating Library Development Markus Püschel and the Spiral team (only part shown) With: Srinivas Chellappa Frédéric de Mesmay Franz Franchetti Daniel McFarlin

1 Spiral Automating Library Development Markus Püschel and the Spiral team (only part shown) With: Srinivas Chellappa Frédéric de Mesmay Franz Franchetti Daniel McFarlin Yevgen Voronenko Electrical and Computer Engineering Carnegie Mellon University This work was supported by DARPA, NSF-NGS/ITR,ACR,CPA, Intel, Mercury, National Instruments

Yevgen Voronenko Electrical and Computer Engineering Carnegie Mellon University This

2 Positions and Thoughts Autotuning definition Search over space of alternatives and Parameter-based tuning are very important but fails to address some key problems; we need to think about Raising the level of abstraction: Enables Use of domain knowledge Difficult optimizations: parallelization, vectorization, etc. Faster porting to new platforms and platform paradigms Possibly automatic software development We need coarse platform abstractions We need more interdisciplinary collaborations Metrics Time for code development, porting to new platforms Performance

3 DFT Plot: Analysis Multiple threads: 2x Memory hierarchy: 5x Vector instructions: 3x High performance library development has become a nightmare

Spiral Research Goal: Teach computers to write fast libraries Complete automation of implementation and optimization Including vectorization, parallelization Functionality: Linear transforms

4 Spiral Research Goal: Teach computers to write fast libraries Complete automation of implementation and optimization Including vectorization, parallelization Functionality: Linear transforms (discrete Fourier transform, filters, wavelets) BLAS SAR imaging En/decoding (Viterbi, Ebcot in JPEG2000) more Platforms: Desktop (vector, SMP), FPGAs, GPUs, distributed, hybrid Collaboration with Intel (Kuck, Tang, Sabanin) Parts of MKL/IPP generated with Spiral IPP 6.0: ippg domain for Spiral generated code

BLAS SAR imaging En/decoding (Viterbi, Ebcot in JPEG2000) more Platforms: Desktop (vector, SMP), FPGAs, GPUs, distributed,

5 human effort automated Carnegie Mellon automated Vision Behind Spiral Current Numerical problem Future Numerical problem C program algorithm selection implementation algorithm selection implementation compilation compilation Computing platform C code a singularity: Compiler has no access to high level information Computing platform Challenge: conquer the high abstraction level for complete automation

implementation compilation compilation Computing platform C code a singularity: Compiler has no

6 Organization Spiral s framework: Example transforms Complete automation achieved Beyond transforms Conclusions and thoughts

7 Linear Transforms Mathematically: Matrix-vector multiplication Output vector Transform = matrix Input vector Example: Discrete Fourier transform (DFT)

8 Transform Algorithms: Example 4-point FFT Cooley/Tukey fast Fourier transform (FFT): j 1 j j 1 j 1 1 j Fourier transform Diagonal matrix (twiddles) Kronecker product Identity Permutation Algorithms are divide-and-conquer: Breakdown rules Mathematical, declarative representation: SPL (signal processing language) SPL describes the structure of the dataflow

Kronecker product Identity Permutation Algorithms are divide-and-conquer: Breakdown rules Mathematical,

9 Breakdown Rules (>200 for >50 Transforms) Combining these rules yields many algorithms for every given transform

10 SPL to Sequential Code Example: tensor product Correct code: easy fast code: very difficult

Program Generation in Spiral (Sketched) Transform user specified Fast algorithm in SPL many choices -SPL: [PLDI 05] Optimization at all abstraction levels

11 Program Generation in Spiral (Sketched) Transform user specified Fast algorithm in SPL many choices -SPL: [PLDI 05] Optimization at all abstraction levels parallelization vectorization loop optimizations C Code: Iteration of this process constant folding to search for the fastest scheduling But that s not all

levels parallelization vectorization loop optimizations C Code: Iteration of

12 SPL to Shared Memory Code: Basic Idea [SC 06] Governing construct: tensor product A A A A Processor 0 Processor 1 Processor 2 Processor 3 p-way embarrassingly parallel, load-balanced Problematic construct: permutations produce false sharing x y Task: Rewrite formulas to extract tensor product + keep contiguous blocks x y

embarrassingly parallel, load-balanced Problematic construct: permutations

13 Parallelization by Rewriting coarse platform model Load-balanced No false sharing

14 Same Approach for Other Parallel Paradigms Message Passing Vectorization MPI Cg/OpenGL for GPUs: Verilog for FPGAs:

15 Example Results CPU+FPGA CPU CPU + GPU CELL GPU

16 Summary: Complete Automation for Transforms Platform: Off-the-shelf desktop Often: generated code faster than competition (if exists) Memory hierarchy optimization Rewriting and search for algorithm selection Rewriting for loop optimizations Vectorization Rewriting Parallelization Rewriting fixed input size code Derivation of library structure Rewriting Other methods general input size library

algorithm selection Rewriting for loop optimizations Vectorization Rewriting Parallelization

hand-written libs Recursion steps: 4 17 Code size: 8

17 Generated Libraries RDFT DHT DCT2 DCT3 DCT4 DFT 2-way vectorized, 2-threaded Most are faster than hand-written libs Recursion steps: 4 17 Code size: kloc or MB Generation time: 1 3 hours Filter Wavelet

18 Organization Spiral s framework: Example transforms Complete automation achieved Beyond transforms Operator language BLAS, Viterbi decoding, SAR imaging, Ebcot encoding Conclusions and thoughts

19 Going Beyond Transforms Transform = linear operator with one vector input and one vector output linear Key ideas: Generalize to (possibly nonlinear) operators with several inputs and several outputs Generalize SPL (including tensor product) to OL (operator language) Generalize rewriting systems for parallelizations

operators with several inputs and several outputs Generalize SPL (including

20 Operator Language Carnegie Mellon

21 Example: Matrix Multiplication (MMM) Breakdown rules = algorithm knowledge: capture various forms of blocking

22 Parallelization through rewriting Load-balanced No false sharing

Speed-up (m x k) times (k x n) example: k 35 30 25 20 15 10 5 MMM Speedup over MKL, n = 32, Core 2 Duo 1.0 1.5x 1.

23 Speed-up (m x k) times (k x n) example: k MMM Speedup over MKL, n = 32, Core 2 Duo x x 'ratio32' u 1:2: m Comparison to GotoBLAS similar

24 Viterbi Decoding in OL Operator for Viterbi decoder Breakdown rules

25 Results Karn s implementation: hand-written assembly for 4 Viterbi codes Spiral 16-way 8-way 4-way scalar Karn 16-way 8-way 4-way scalar

26 EBCOT Coding in OL Carnegie Mellon

27 Organization Spiral s framework: Example transforms Complete automation achieved Beyond transforms Conclusions and thoughts

28 Raising the Abstraction Level Formally describe and structure algorithms/applications eternally valid In Spiral Domain-specific, declarative, mathematical language OL Difficult optimizations/transformations by rewriting What it enables Vectorization, parallelization using domain knowledge Efficient retargeting to new platforms and new platform paradigms Complete automation in some cases Other examples Libraries Identification and definition of BLAS Parameter tuning Indispensable tool but cannot achieve the above

29 Interdisciplinary Research Needed Programming languages Program generation Symbolic Computation Rewriting Software Scientific Computing Automating High-Performance Parallel Library Development Algorithms Mathematics Compilers We Need to Work Together

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute [email protected].

Medical Image Processing on the GPU Past, Present and Future Anders Eklund, PhD Virginia Tech Carilion Research Institute [email protected] Outline Motivation why do we need GPUs? Past - how was GPU programming