Optimizing Code for Accelerators: The Long Road to High Performance

Optimizing Code for Accelerators: The Long Road to High Performance Hans Vandierendonck Mons GPU Day November 9 th, 2010

The Age of Accelerators 2

Accelerators in Real Life 3

Latency (ps/inst) Why Accelerators? Exploit available transistor budget Increased performance 30:1 1,000:1 Energy-efficiency 30,000:1 (performance/watt) Dally, et al. The Classical Computer, ISAT study, 2001 Sacrifice generality (application-specific) 30/09/09 CSL Intro 4

GPP vs Accelerators AMD Opteron 2380 NVIDIA Tesla c1060 ATI FirePro V8700 STI Cell QS21 (8 SPEs) Core Frequency 2.7 GHz 1.3 GHz 750 MHz 3.2 GHz Processor cores 8 240 160 8 Data-level Parallelism 4-way SIMD 8-way SIMT 5-way VLIW 4-way SIMD Memory BW 24 GB/s 102.4 GB/s 108.8 GB/s 25.6 GB/s (in) 35 GB/s (out) Peak Perf SP (GFLOPS) 160 933 816 204.8 Peak Perf DP (GFLOPS) 80 77.76 240 14.63 TDP 75W 187W 114W 45W Feature size 45nm 55nm 55nm 90nm 5

Massively Parallel Architectures E.g. NVIDIA GPU Architecture 6

Position of This Talk Let s say we want to do this What performance can we obtain? How to optimize the code? How much effort? What happens to the code? Will we regret this? Method: implementing algorithms Study performance portability Optimize an application for the Cell processor 7

Overview Introduction Performance Portability ClustalW and Cell BE Single-SPU Optimizations Parallelization Analysis 8

Method: Compare Code Optimizations Across Processors Loop Unrolling Vectorization Processors Loop Body Body R 1 R 2 Single Vector instruction ai R1, R2, 1 A B C D 1 1 1 1 A B C D CPU: Core i7 Tesla c1060 FirePro V8700 Cell QS21 Body Evaluate code optimizations on several accelerators OpenCL: Functional portability 13

Execution time (relative) Performance Impact of Loop Unrolling and Vectorization 2,0 1,8 1,6 1,4 1,2 1,0 0,8 0,6 0,4 0,2 0,0 1 2 4 8 16 32 1 2 4 8 16 32 scalar vector Benchmark: cp Block size Cell: 1 Block size others: 128 CPU Tesla FirePro Cell 17

Execution time (relative) Performance Impact of Loop Unrolling and Vectorization 2,0 1,8 1,6 1,4 1,2 1,0 0,8 0,6 0,4 0,2 0,0 CPU benefits from vectorization CPU is indifferent to loop unrolling 1 2 4 8 16 32 1 2 4 8 16 32 scalar vector Benchmark: cp Block size Cell: 1 Block size others: 128 CPU Tesla FirePro Cell 18

Execution time (relative) Performance Impact of Loop Unrolling and Vectorization 2,0 1,8 1,6 1,4 1,2 1,0 0,8 0,6 0,4 0,2 0,0 Vectorization is critical to Cell Cell benefits from loop unrolling 1 2 4 8 16 32 1 2 4 8 16 32 scalar vector Benchmark: cp Block size Cell: 1 Block size others: 128 CPU Tesla FirePro Cell 19

Execution time (relative) Performance Impact of Loop Unrolling and Vectorization 2,0 1,8 1,6 1,4 1,2 1,0 0,8 0,6 0,4 0,2 0,0 Optimizations interact with each other FirePro more sensitive Too much unrolling degrades performance 1 2 4 8 16 32 1 2 4 8 16 32 scalar vector Benchmark: cp Block size Cell: 1 Block size others: 128 CPU Tesla FirePro Cell 20

Execution time (relative) Performance Impact of Loop Unrolling and Vectorization 2,0 1,8 1,6 1,4 Vectorization is critical to Cell 2,0 1,8 1,6 1,4 Vectorization is critical to Cell 1,2 1,2 1,0 0,8 0,6 Cell benefits from loop unrolling 1,0 0,8 0,6 0,4 0,4 0,2 0,2 0,0 1 2 4 8 16 32 1 2 4 8 16 32 0,0 1 2 4 8 16 32 1 2 4 8 16 32 scalar vector scalar vector CPU Tesla FirePro Cell Benchmark: cp CPU Tesla FirePro Cell Benchmark: mri-fhd 21

Execution time (secs) Performance Impact of Thread Block Size (Parallelism) 20 18 16 14 12 10 8 6 4 2 0 1 4 16 128 256 512 Block size CPU Tesla FirePro Cell Thread block size is the most important parameter for Tesla Benchmark: mri-fhd Loop unrolling: 2 Vectorization: yes 25

What have we learned from this? Performance portability: no free lunch Optimizations: architecture specific Program specific sensitivity Interaction between optimizations Potentially large speedups 27

Cell processor: a high-level overview Power Processing Element (PPE) 64-bit PowerPC RISC 32 KB IL1, 32 KB DL1, 512 KB L2 In-order instruction issue 2-way superscalar Synergistic Processing Element (SPE) 128-bit in-order RISC-like vector processor 256 KB Local Store, explicit memory management SIMD, no hardware branch predictor System Memory Interface (MFC) 16 B/cycle 25.6 GB/s (1.6 GHz) Element Interconnect Bus (EIB) 204.8 GB/s peak BW 4 data rings of 16 bytes A Heterogeneous multi-core 29

Clustal W: a program for alignment of nucleotide or amino acid sequences N = Number of sequences L = Typical sequence length Pairwise alignment Space = O(N 2 +L 2 ) Time = O(N 2 L 2 ) Guide tree Space = O(N 2 ) Time = O(N 4 ) atgagttcttaa gattgttgcc gccttcttgtta cgttaacttc Progressive alignment Space = O(N 2 +L 2 ) Time = O(N 4 +L 2 ) 32

10 50 100 500 1000 10 50 100 500 1000 10 50 100 500 1000 10 50 100 500 1000 10 50 100 500 1000 Percentage execution time Analysis of Clustal W: determine what is important 100% 80% 60% 40% 20% PW GT PA 0% N = 10 N = 50 N = 100 N = 500 N = 1000 Number of sequences (lower) & Sequence length (upper) 33

Execution time (secs) Execution time (secs) 2500 2000 1500 1000 500 0 A simple port from PPU to SPU Minimal code changes to allow execution on SPU Important slowdown Overhead of DMAs and mailboxes not the cause is not enough Pairwise Alignment 2500 Progressive Alignment x 0.87 2000 1500 x 0.73 PPU SPU-base 1000 500 0 PPU SPU-base QS21 dual Cell BE-based blade Each Cell BE @ 3.2 GHz gcc 4.1.1 Fedora Core 7 35

Optimizing for the SPU: Loop Structure In both phases the majority of work is performed by 3 consecutive loop nests Pairwise alignment Progressive alignment forward Increasing indices for sequence arrays forward backward Decreasing indices for sequence arrays backward 3 rd Using intermediate values of fwd &bwd 3 rd Forward loop most important for pairwise alignment 3 rd loop performs least work in both phases 36

Execution time (secs) Optimizing for the SPU: Control Flow Optimization Branch misprediction is expensive (18 cycles) Convert control flow to data flow using compare & select 2500 2000 1500 1000 500 Pairwise Alignment x 1.35 Determine maximum 0 If(x > max) max = x; max = spu_sel(max, x, cmpgt(x, max)); 37

Optimizing for the SPU: j-loop Vectorization f, e, s 4 x 32bits i-loop HH[j] Vectorize by 4 i-loop iterations Vector length = 128 bits 38

Execution time (secs) Difficulty 1 Optimizing for the SPU: Construction of loop pre- and post-ambles Difficulty 2 Pairwise alignment computes position of maximum Vectorization changes execution order For each of the vector lanes keep position of maximum; afterwards select maximum that occurred first in original program order Vectorization (cont) 2500 2000 1500 1000 500 0 Pairwise Alignment x 2.67 39

Unaligned Vector Loads and Stores in Hot Loops of Pairwise Alignment Loop structure vector inth; for(j=1; j< N; ++j) { } h[0..3] = HH[j..j+3]; // operations on h; HH[j..j+3] = h[0..3]; Code structure results from vectorization Memory accesses HH[j] 0 1 2 3 4 5 6 7 8 9 iteration 0 iteration 1 iteration 2 iteration 3 iteration 4 40

Unaligned Vector Loads on the Cell Unaligned vector load vector intv, qw0, qw1, part0, part1; unsigned in shift; 7 HH[j] HH[j+1] shift = (unsigned int)(&hh[j]) & 15; qw0 = *(( vector int *)&HH[j]); qw1 = *((( vector sint *)&HH[j])+1); part0 = spu_slqwbyte( qw0, shift ); part1 = spu_rlmaskqwbyte( qw1, (signed int)( shift - 16 ) ); v = spu_or( part0, part1 ); 41

Unaligned Vector Stores on the Cell Unaligned vector store vector intv, qw0, qw1, merge0, merge1; vector unsigned int mask; unsigned int shift; shift = (unsigned int)(&hh[j]) & 15; qw0 = *(( vector int *)&HH[j]); qw1 = *((( vector int *)&HH[j])+1); mask = ( vector unsigned int) spu_rlmaskqwbyte( spu_promote((unsigned char)0xff, 0), -shift); v = spu_rlqwbyte( v, -shift ); merge0 = spu_sel( qw0, v, mask ); merge1 = spu_sel( v, qw1, mask ); *( vector int *)(&HH[j-3]) = merge0; *(( vector int *)(&HH[j-3])+1) = merge1; 10 HH[j] HH[j+1] HH[j] HH[j+1] 42

HH[j] iteration 0 iteration 1 iteration 2 iteration 3 iteration 4 Unaligned Vector Loads and Stores in Hot Loops of Pairwise Alignment Memory accesses address HH[1] is vector-aligned 0 1 2 3 4 5 6 7 8 9 4 iterations access 2 aligned vectors Optimizations Unroll loop 4 times Ensure first iteration accesses aligned elements Cache memory values in registers Remove redundant computations One aligned vector load and store per cycle in regime 43

Execution time (secs) Optimizing for the SPU: Loop Unrolling to avoid unaligned DMAs Unaligned vector access 4*17 = 68 instructions if unoptimized Reduced to 24 instructions in regime Optimization applies to multiple arrays Optimization applied only to regime loop 2500 2000 1500 1000 500 0 Pairwise Alignment x 1.94 46

Execution time (secs) Code size (KB) Optimizing for the SPU: Pairwise Alignment 2500 2000 1500 1000 500 0 Pairwise Alignment 9 8 7 6 5 4 3 2 1 0 forward backward 3rd 47

Execution time (secs) Code size (KB) Optimizing for the SPU: Progressive Alignment 1400 1200 1000 800 600 400 200 0 Progressive Alignment x 1.35 x 1.62 x 2.07 18 16 14 12 10 8 6 4 2 0 forward backward 48

Speedup Parallelization of Pairwise Alignment Scores for every pair of sequences can be calculated independently i j 1 work package Maximize load balance by processing packages in order of decreasing size 100 90 80 70 60 50 40 30 20 10 0 Pairwise Alignment 1 2 3 4 5 6 7 8 9 10111213141516 Number of SPUs 50

Parallelization of Progressive Alignment: Inter-loop Parallelism Exploit parallelism between loop nests Progressive alignment Work scheduling forward forward backward backward 3 rd 3 rd 3 rd loop performs less work 51

iteration Parallelization of Progressive Alignment: Intra-loop Parallelism (1) Single-threaded for(i=0; i< N; ++i) { s = prfscore(i,j); forward // other computations // with cross-iteration // dependences D(s); } Parallel-Stage Pipeline prf score prf score prf score D(s) D(s) time D(s) 52

Parallelization of Progressive Alignment: Intra-loop Parallelism (2) Parallel-Stage: prfscore() NThreads threads execute code in this stage Sequential Stage: D(s) 1 thread executes code in this stage // For thread T of NThreads for(i=0; i< N; ++i) { if( i % Nthreads == T ) { s = prfscore(i,j); put(s, Queue[T]); } } for(i=0; i< N; ++i) { s = get(q[i % NThreads]); D(s); Demonstrated for round-robin distribution In practice take vectorization into account and distribute larger blocks to minimize branches! } 53

Speedup Parallelization of Progressive Exploiting both inter-loop and intraloop parallelism Optimized inter-spu communication by using non-blocking SPU-to-SPU local store copies Alignment 4,0 3,5 3,0 2,5 Progressive alignment SPU 2 A-set prfscore SPU 4 A-set prfscore SPU 0 SPU 3 fwd B-set prfscore SPU 1 SPU 5 bwd 3 rd B-set prfscore 2,0 1,5 1,0 0,5 0,0 1 SPU 6 SPU 54

Speedup Overall Speedup Pairwise Alignment Progressive Alignment calcprf1 Total 100 90 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of SPUs 55

Execution Time (seconds) Total Speedup is Limited by Least Parallel Part 3000 2500 2000 Progressive Alignment Pairwise Alignment 1500 1000 500 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of SPUs 57

Execution time (seconds) Execution time (seconds) Comparison to Homogeneous Dual Cell BE blade Strongly optimized 3000 2500 2000 1500 Multicore Progressive Alignment Pairwise Alignment AMD Opteron 2378 2x4 cores Parallelized, no optimizations 3000 2500 2000 1500 Progressive Alignment Pairwise Alignment 1000 500 0 0 2 4 6 8 10 12 14 16 Number of SPUs 1000 500 0 1 2 3 4 5 6 7 8 Number of cores 58

Execution time (seconds) Execution time (seconds) Comparison to Homogeneous Dual Cell BE blade Strongly optimized 3000 2500 2000 1500 Multicore AMD Opteron 2378 2x4 cores Parallelized, no optimizations 3000 Progressive 2500 Progressive Alignment Time spent in progressive alignment Alignment Pairwise is higher 2000on Cell BE Pairwise Alignment in baseline and after optimization Alignment 1500 1000 500 0 0 2 4 6 8 10 12 14 16 Number of SPUs 1000 500 0 1 2 3 4 5 6 7 8 Number of cores 59

Execution time (seconds) Execution time (seconds) 1-SPU Optimizations Are Compensated by Smart Hardware Dual Cell BE blade Strongly optimized 3000 2500 2000 1500 AMD Opteron 2378 2x4 cores Parallelized, no optimizations 3000 Progressive 2500 Progressive Strongly Alignment optimized 1-SPU implementation Alignment Pairwise as 2000 fast as Pairwise Alignment Alignment 1500 single-threaded on general-purpose fat core 1000 500 0 0 2 4 6 8 10 12 14 16 Number of SPUs 1000 500 0 1 2 3 4 5 6 7 8 Number of cores 60

Execution time (seconds) Execution time (seconds) What if GPP Code is Vectorized? Dual Cell BE blade Strongly optimized 3000 2500 2000 1500 AMD Opteron 2378 2x4 cores Parallelized, no optimizations 3000 Progressive 2500 Progressive Ratio Alignment of AMD/Cell for pairwise alignment is Alignment 2.21. Pairwise It is plausible 2000that this gap can be Pairwise Alignment bridged by vectorization Alignment 1500 1000 500 0 0 2 4 6 8 10 12 14 16 Number of SPUs 1000 500 0 1 2 3 4 5 6 7 8 Number of cores 61

Execution time (seconds) Execution time (seconds) Overall, Cell BE Looses the Comparison Dual Cell BE blade Strongly optimized 3000 2500 2000 1500 AMD Opteron 2378 2x4 cores Parallelized, no optimizations 3000 Progressive 2500 Alignment AMD is faster by 28% 1500 Progressive Alignment due Pairwise to lack of performance 2000 of Cell BE on Alignment progressive alignment Pairwise Alignment 1000 500 0 0 2 4 6 8 10 12 14 16 Number of SPUs 1000 500 0 1 2 3 4 5 6 7 8 Number of cores 62

Execution time (seconds) Execution time (seconds) Hybrid System Is Fastest Dual Cell BE blade Strongly optimized 3000 2500 2000 1500 AMD Opteron 2378 2x4 cores Parallelized, no optimizations 3000 Progressive 2500 Progressive Execute Alignment pairwise alignment on Cell BE; Alignment Pairwise progressive 2000 alignment on GPP: Pairwise Alignment±40% faster than AMD Alignment 1500 1000 500 0 0 2 4 6 8 10 12 14 16 Number of SPUs 1000 500 0 1 2 3 4 5 6 7 8 Number of cores 63

Looking back Let s say we have done this What performance can we obtain? Significant speedups How much effort? Significant: several person-months What happens to the code? Huge code transformations and duplications Can t really maintain it any more Will we regret this? In this case, the gains compared to homogeneous multicores are negative Lack of parallelism in part of the program 64

Conclusion Accelerators contain many simple cores Significant optimization required for single-core performance Each accelerator architecture benefits from different optimizations There is no compiler to do the job for you! Fat general-purpose cores do much of this work for you Performance improvements are worthwhile Optimization is time-consuming, error-prone, architecture specific Nearly intractable to maintain optimized code Beware of Amdahl s Law: require scalable parallelism in all parts of the application 65

Acknowledgements Collaborators Sean Rul JorisD Haene MichielQuestier Koen De Bosschere 66