Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms

Similar documents

DARPA, NSF-NGS/ITR,ACR,CPA,

Fast Fourier Transform: Theory and Algorithms

DFT Performance Prediction in FFTW

Optimizing matrix multiplication Amitabha Banerjee

IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

64-Bit versus 32-Bit CPUs in Scientific Computing

Guidelines for Software Development Efficiency on the TMS320C6000 VelociTI Architecture

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Sources: On the Web: Slides will be available on:

The Design and Implementation of FFTW3

Code generation under Control

Performance Application Programming Interface

A Lab Course on Computer Architecture

Computer Organization and Architecture

Matrix Multiplication

Multi-Threading Performance on Commodity Multi-Core Processors

THE NAS KERNEL BENCHMARK PROGRAM

The Impact of Memory Subsystem Resource Sharing on Datacenter Applications. Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa

How To Write Fast Numerical Code: A Small Introduction

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

A Pattern-Based Approach to. Automated Application Performance Analysis

Performance Monitoring of the Software Frameworks for LHC Experiments

Managing Adaptability in Heterogeneous Architectures through Performance Monitoring and Prediction

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Types of Workloads. Raj Jain. Washington University in St. Louis

1.4 Fast Fourier Transform (FFT) Algorithm

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

An Introduction to Assembly Programming with the ARM 32-bit Processor Family

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

MAQAO Performance Analysis and Optimization Tool

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Software Quality Factors OOA, OOD, and OOP Object-oriented techniques enhance key external and internal software quality factors, e.g., 1. External (v

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

Agile High-Performance Software Development

SR-IOV: Performance Benefits for Virtualized Interconnects!

Computer Architecture Syllabus of Qualifying Examination

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

Silverlight for Windows Embedded Graphics and Rendering Pipeline 1

@Scalding. Based on talk by Oscar Boykin / Twitter

T Reactive Systems: Introduction and Finite State Automata

High Performance Matrix Inversion with Several GPUs

İSTANBUL AYDIN UNIVERSITY

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

CS (CCN 27156) CS (CCN 26880) Software Engineering for Scientific Computing. Lecture 1: Introduction

Spring 2011 Prof. Hyesoon Kim

Navigating Big Data with High-Throughput, Energy-Efficient Data Partitioning

How to Write Fast Code SIMD Vectorization , spring th and 14 th Lecture

Performance Optimization Guide

SOLVING LINEAR SYSTEMS

Reduced Precision Hardware for Ray Tracing. Sean Keely University of Texas, Austin

picojava TM : A Hardware Implementation of the Java Virtual Machine

1 Introduction. 2 Overview of the Tool. Program Visualization Tool for Educational Code Analysis

Levels of Programming Languages. Gerald Penn CSC 324

x64 Servers: Do you want 64 or 32 bit apps with that server?

Evading Android Emulator

Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs

The programming language C. sws1 1

A Design of a PID Self-Tuning Controller Using LabVIEW

Recent Advances in Periscope for Performance Analysis and Tuning

The C Programming Language course syllabus associate level

EEM 486: Computer Architecture. Lecture 4. Performance

Getting the Performance Out Of High Performance Computing. Getting the Performance Out Of High Performance Computing

COSCO 2015 Heterogeneous Computing Programming

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

3.1 State Space Models

Weighted Total Mark. Weighted Exam Mark

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Stack Allocation. Run-Time Data Structures. Static Structures

Regression Verification: Status Report

A Multi-layered Domain-specific Language for Stencil Computations

OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old

HP Matrix Operating Environment Federated CMS Overview

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

COMPUTER SCIENCE (5651) Test at a Glance

Service Partition Specialized Linux nodes. Compute PE Login PE Network PE System PE I/O PE

Agile QA Process. Anand Bagmar Version 1.

Petascale Software Challenges. William Gropp

X86-64 Architecture Guide

HSL and its out-of-core solver

OpenCL for programming shared memory multicore CPUs

JReport Server Deployment Scenarios

Architecture of Hitachi SR-8000

Windows XP Professional x64 Edition for HP Workstations - FAQ

OpenACC 2.0 and the PGI Accelerator Compilers

Distributed Computing and Big Data: Hadoop and MapReduce

Performance Analysis and Visualization of SystemC Models. Adam Donlin and Thomas Lenart Xilinx Research

Transcription:

Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms Ayaz Ali and Prof. Lennart Johnsson Texas Learning and Computation Center University of Houston Dr. Dragan Mirkovic MD Anderson Cancer Center University of Texas, Houston

Outline Motivation Introduction Empirical Optimization Results Conclusion

Hardware/Software Gap End of Moore s Law has pushed the trend towards Specialized architectures. Result: Increasingly difficult for compilers to perform optimizations. Compilers have always lagged the Hardware Technology Advancements.

Hardware/Software Gap Traditional ways to Achieve Higher Performance Vendor compilers. Source to source transformations, mainly targeting loops. Hand Tuning. Compilers are not good at optimizing even the regular Matrix Multiplication (MMM).

Adaptive Auto-tuning Libraries Auto-tuning performed in two stages. Code Generator adapts to the Microprocessor Architecture by generating highly optimized blocks of codes (micro-kernels) at installation time. Run-time adapts to the Memory System by scheduling the execution of those blocks. Examples: ATLAS, FFTW, UHFFT etc.

Adaptive FFT Libraries UHFFT and FFTW Two Stage Optimization The Code Generator generates microkernels of optimized straight-line C code blocks (radix/codelets) that compute part of the FFT problem The Run-Time SPIRAL schedules the execution of a FFT problem by composing parameterized codelets generated by code generator at the installation time Single Stage (Code generator system).

Tuning FFT Many formulas to represent a single FFT (Different Operation Counts). FFT s Cache oblivious schedule allows register blocking but it can be improved for larger size codelets. Because of the recursive schedule, loop level optimizations are not as effective as in BLAS. Strided (Non-sequential) data access in FFT.

FFT Formula Many formulas to represent FFT (e.g. size 12). Lower OpCount does not always lead to higher performance Size 12 FFT

Instruction Scheduling Version 1: Version 2: tmp0i = Im(x[0])+Im(x[2*is]); tmp0r = Re(x[0])+Re(x[2*is]); tmp1i = Im(x[0])-Im(x[2*is]); tmp1r = Re(x[0])-Re(x[2*is]); tmp2i = Im(x[is])+Im(x[3*is]); tmp2r = Re(x[is])+Re(x[3*is]); tmp3i = Im(x[is])-Im(x[3*is]); tmp3r = Re(x[is])-Re(x[3*is]); Re(y[0]) = tmp0r+tmp2r; Im(y[0]) = tmp0i+tmp2i; Re(y[os]) = tmp1r+tmp3i; Im(y[os]) = tmp1i-tmp3r; Re(y[2*os]) = tmp0r-tmp2r; Im(y[2*os]) = tmp0i-tmp2i; Re(y[3*os]) = tmp1r-tmp3i; Im(y[3*os]) = tmp1i+tmp3r; Two versions of Codelet size 4. Upto 7 % Improvement. tmp3r = Re(x[is])-Re(x[3*is]); tmp2r = Re(x[is])+Re(x[3*is]); tmp3i = Im(x[is])-Im(x[3*is]); tmp2i = Im(x[is])+Im(x[3*is]); tmp1r = Re(x[0])-Re(x[2*is]); tmp0r = Re(x[0])+Re(x[2*is]); tmp1i = Im(x[0])-Im(x[2*is]); tmp0i = Im(x[0])+Im(x[2*is]); Re(y[2*os]) = tmp0r-tmp2r; Re(y[0]) = tmp0r+tmp2r; Im(y[os]) = tmp1i-tmp3r; Im(y[3*os]) = tmp1i+tmp3r; Re(y[3*os]) = tmp1r-tmp3i; Re(y[os]) = tmp1r+tmp3i; Im(y[2*os]) = tmp0i-tmp2i; Im(y[0]) = tmp0i+tmp2i; Platform Compiler MFLOPS (Ver:1) MFLOPS (Ver:2) Itanium2(1.5GHz) gcc 1815 1878 (3-4%) Itanium2 icc 8 1932 2065 (7%) Itanium2 icc 9 1932 1997 (3-4%) Opteron(2.0GHz) Pathscale 2449 2597 (5-6%) Opteron gcc 2372 2500 (5-6%)

Code Generation (UHFFT1.6 vs FFTW3) FFTW UHFFT1.6 (older) FFT Formula Minimization Schedule Cache Oblivious followed by DAG reordering Heuristics based Register Blocking Yes No Cache Oblivious schedule of instructions Architecture (ISA) Specific Codelets Codelet Types Generator Further Optimizations Yes (e.g. fma, sse etc.) Straight-line and single loop (vector-recursion) Written Caml (rule based language) Expert User selects the best after some trials Relies on compiler completely. Straight-line Written in C (allows more flexibility) Expert User selects the best after some trials

Empirical Optimization Iterative Empirical Optimization through Compiler Feedback. Auto-tune Non-deterministic elements in performance of codes. Adapt to Architecture as well as compiler. Recently: research focusing Iterative compiler optimization of whole applications. Successfully used in ATLAS for BLAS codes. Costly but manageable (esp. in microkernels).

Empirical Optimization Basic Ingredient (Iterative Compilation Loop) Generate Variants (varying parameters) Dynamic Compilation Evaluate performance (runtime) Guide tuning process

UHFFT Empirical Auto-tuning Code Generator

FFDL Grammar GRAMMAR FFT Formula Description Language (FFDL) to specify the FFT Formula.

Codelet Types Codelet Types Compact String representation of Codelet Types

Codelet Types Twiddle Codelet Perform FFT and multiply with twiddle factors. Vector Codelets Compute multiple FFTs on 2D blocks by pushing loop inside the codelets. CodeletN( ) { For (i=1 to r) compute fftn Rotated Codelets For PFA algorithm (Less Multiplications; No Twiddle Multiplications) }

Level 1 Optimizations Mode CIRC SKEW SKEWP Rader Implementation Circulant Permutation Skew Circulant Skew Circulant with Partial Diagonalization

Level 1 Optimizations Real FFT

Level 2 Optimizations Re(y[os]) = Re(x[0])-Re(x[is]); Im(y[0]) = Im(x[0])+Im(x[is]); Re(y[0]) = Re(x[0])+Re(x[is]); Im(y[os]) = Im(x[0])-Im(x[is]); { // block 4 int reg0,reg1,reg2,reg3; { // block 2 int reg4,reg5; } Im(y[0]) = Im(x[0])+Im(x[is]); Re(y[0]) = Re(x[0])+Re(x[is]); Im(y[os]) = Im(x[0])-Im(x[is]); Re(y[os]) = Re(x[0])-Re(x[is]); } { // block 2 int reg6,reg7; }

Level 2 Optimizations Real FFT

Level 3 Optimizations Y[0] = X[0]+X[1]; Y[1] = X[0]-X[1]; reg0 = X[0]; reg1 = X[1]; reg2 = reg0 + reg1; reg3 = reg0 reg1; Y[0] = reg2; Y[1] = reg3;

Level 3 Optimizations Real FFT

Results Experimental Setup: Two Architectures (Itanium2 and Opteron) Two Compilers on each Architecture: icc/gcc and pathcc/gcc Performance Metric: MFlop/s Vary Codelet strides to mimic actual execution inside the FFT schedule. (Larger strides => Lower Performance).

Codelets (Itanium)

Codelets (Opteron)

Codelet Size vs Performance Itanium has more registers Opteron has larger instruction cache

Performance Improvement (UHFFT2) FFT (Powers of 2 Size) on Itanium2 using ICC Major Gain due to Vector Codelets

Performance Improvement (UHFFT2) FFT (Non-Powers of 2) on Itanium2 using ICC Major Gain due to Special PFA Codelets

Performance Improvement (UHFFT2) FFT (Powers of 2 Size) on Itanium2 using GCC Major Gain due to Scalar Replacement

Performance Improvement (UHFFT2) FFT (Non-Powers of 2) on Itanium2 using GCC Major Gain due to Special PFA Codelets

Conclusion Mixed Results; this technique generates codelets at least as good as any heuristics. Better the Compiler => lesser the impact. Performance portability across compilers. Major performance improvement for larger size codelets but large codelets hardly used because of lower performance. Simple heuristics could be used in some cases instead of expensive empirical auto-tuning specially for powers of two codelets.

Thank You!

Baseline Performance UHFFT Performance (Prior to Empirical Tuning)

Baseline Performance Special Codelets (e.g. fma etc) in FFTW and MKL UHFFT slightly better at managing memory

Code Generation (FFTW3 vs UHFFT2) FFTW UHFFT FFT Formula Minimization Schedule Uses Heurisitics Cache Oblivious followed by DAG reordering Try limited number of (heuristics based) formulas plus any user supplied formulas Cache Oblivious schedule followed by DAG reordering Register Blocking Yes Multiple blockings empirically evaluated Architecture (ISA) Specific Codelets Codelet Types Yes (e.g. fma, sse etc.) Straight-line and looped FFT, Twiddle and In-place (transposed) codelets (No PFA) No (Relies on compiler completely). Straight-line and looped FFT, Twiddle, Inplace and rotated PFA codelets. Generator Written Caml (rule based language) Written in C (allows more flexibility) Adaptation Adapt to Architecture Adapt to both Architecture and Compiler Optimizations Based on Heuristics (Trial and Error Method for more options) Empirically Tuned Automatic Code generator Motivation

Our Solution Generate more and specialized microkernels. Consider more FFT Formulas. Use register blocking to avoid register pressure. Use Iterative Empirical Optimization through Compiler Feedback.