Optimizing Code for Accelerators: The Long Road to High Performance

Transcription

1 Optimizing Code for Accelerators: The Long Road to High Performance Hans Vandierendonck Mons GPU Day November 9 th, 2010

2 The Age of Accelerators 2

3 Accelerators in Real Life 3

4 Latency (ps/inst) Why Accelerators? Exploit available transistor budget Increased performance 30:1 1,000:1 Energy-efficiency 30,000:1 (performance/watt) Dally, et al. The Classical Computer, ISAT study, 2001 Sacrifice generality (application-specific) 30/09/09 CSL Intro 4

5 GPP vs Accelerators AMD Opteron 2380 NVIDIA Tesla c1060 ATI FirePro V8700 STI Cell QS21 (8 SPEs) Core Frequency 2.7 GHz 1.3 GHz 750 MHz 3.2 GHz Processor cores Data-level Parallelism 4-way SIMD 8-way SIMT 5-way VLIW 4-way SIMD Memory BW 24 GB/s GB/s GB/s 25.6 GB/s (in) 35 GB/s (out) Peak Perf SP (GFLOPS) Peak Perf DP (GFLOPS) TDP 75W 187W 114W 45W Feature size 45nm 55nm 55nm 90nm 5

6 Massively Parallel Architectures E.g. NVIDIA GPU Architecture 6

7 Position of This Talk Let s say we want to do this What performance can we obtain? How to optimize the code? How much effort? What happens to the code? Will we regret this? Method: implementing algorithms Study performance portability Optimize an application for the Cell processor 7

8 Overview Introduction Performance Portability ClustalW and Cell BE Single-SPU Optimizations Parallelization Analysis 8

10 Method: Compare Code Optimizations Across Processors Loop Unrolling Vectorization Processors Loop Body Body R 1 R 2 Single Vector instruction ai R1, R2, 1 A B C D A B C D CPU: Core i7 Tesla c1060 FirePro V8700 Cell QS21 Body Evaluate code optimizations on several accelerators OpenCL: Functional portability 13

11 Execution time (relative) Performance Impact of Loop Unrolling and Vectorization 2,0 1,8 1,6 1,4 1,2 1,0 0,8 0,6 0,4 0,2 0, scalar vector Benchmark: cp Block size Cell: 1 Block size others: 128 CPU Tesla FirePro Cell 17

12 Execution time (relative) Performance Impact of Loop Unrolling and Vectorization 2,0 1,8 1,6 1,4 1,2 1,0 0,8 0,6 0,4 0,2 0,0 CPU benefits from vectorization CPU is indifferent to loop unrolling scalar vector Benchmark: cp Block size Cell: 1 Block size others: 128 CPU Tesla FirePro Cell 18

13 Execution time (relative) Performance Impact of Loop Unrolling and Vectorization 2,0 1,8 1,6 1,4 1,2 1,0 0,8 0,6 0,4 0,2 0,0 Vectorization is critical to Cell Cell benefits from loop unrolling scalar vector Benchmark: cp Block size Cell: 1 Block size others: 128 CPU Tesla FirePro Cell 19

14 Execution time (relative) Performance Impact of Loop Unrolling and Vectorization 2,0 1,8 1,6 1,4 1,2 1,0 0,8 0,6 0,4 0,2 0,0 Optimizations interact with each other FirePro more sensitive Too much unrolling degrades performance scalar vector Benchmark: cp Block size Cell: 1 Block size others: 128 CPU Tesla FirePro Cell 20

15 Execution time (relative) Performance Impact of Loop Unrolling and Vectorization 2,0 1,8 1,6 1,4 Vectorization is critical to Cell 2,0 1,8 1,6 1,4 Vectorization is critical to Cell 1,2 1,2 1,0 0,8 0,6 Cell benefits from loop unrolling 1,0 0,8 0,6 0,4 0,4 0,2 0,2 0, , scalar vector scalar vector CPU Tesla FirePro Cell Benchmark: cp CPU Tesla FirePro Cell Benchmark: mri-fhd 21

16 Execution time (secs) Performance Impact of Thread Block Size (Parallelism) Block size CPU Tesla FirePro Cell Thread block size is the most important parameter for Tesla Benchmark: mri-fhd Loop unrolling: 2 Vectorization: yes 25

17 What have we learned from this? Performance portability: no free lunch Optimizations: architecture specific Program specific sensitivity Interaction between optimizations Potentially large speedups 27

19 Cell processor: a high-level overview Power Processing Element (PPE) 64-bit PowerPC RISC 32 KB IL1, 32 KB DL1, 512 KB L2 In-order instruction issue 2-way superscalar Synergistic Processing Element (SPE) 128-bit in-order RISC-like vector processor 256 KB Local Store, explicit memory management SIMD, no hardware branch predictor System Memory Interface (MFC) 16 B/cycle 25.6 GB/s (1.6 GHz) Element Interconnect Bus (EIB) GB/s peak BW 4 data rings of 16 bytes A Heterogeneous multi-core 29

20 Clustal W: a program for alignment of nucleotide or amino acid sequences N = Number of sequences L = Typical sequence length Pairwise alignment Space = O(N 2 +L 2 ) Time = O(N 2 L 2 ) Guide tree Space = O(N 2 ) Time = O(N 4 ) atgagttcttaa gattgttgcc gccttcttgtta cgttaacttc Progressive alignment Space = O(N 2 +L 2 ) Time = O(N 4 +L 2 ) 32

21 Percentage execution time Analysis of Clustal W: determine what is important 100% 80% 60% 40% 20% PW GT PA 0% N = 10 N = 50 N = 100 N = 500 N = 1000 Number of sequences (lower) & Sequence length (upper) 33

23 Execution time (secs) Execution time (secs) A simple port from PPU to SPU Minimal code changes to allow execution on SPU Important slowdown Overhead of DMAs and mailboxes not the cause is not enough Pairwise Alignment 2500 Progressive Alignment x x 0.73 PPU SPU-base PPU SPU-base QS21 dual Cell BE-based blade Each Cell 3.2 GHz gcc Fedora Core 7 35

24 Optimizing for the SPU: Loop Structure In both phases the majority of work is performed by 3 consecutive loop nests Pairwise alignment Progressive alignment forward Increasing indices for sequence arrays forward backward Decreasing indices for sequence arrays backward 3 rd Using intermediate values of fwd &bwd 3 rd Forward loop most important for pairwise alignment 3 rd loop performs least work in both phases 36

25 Execution time (secs) Optimizing for the SPU: Control Flow Optimization Branch misprediction is expensive (18 cycles) Convert control flow to data flow using compare & select Pairwise Alignment x 1.35 Determine maximum 0 If(x > max) max = x; max = spu_sel(max, x, cmpgt(x, max)); 37

26 Optimizing for the SPU: j-loop Vectorization f, e, s 4 x 32bits i-loop HH[j] Vectorize by 4 i-loop iterations Vector length = 128 bits 38

27 Execution time (secs) Difficulty 1 Optimizing for the SPU: Construction of loop pre- and post-ambles Difficulty 2 Pairwise alignment computes position of maximum Vectorization changes execution order For each of the vector lanes keep position of maximum; afterwards select maximum that occurred first in original program order Vectorization (cont) Pairwise Alignment x

28 Unaligned Vector Loads and Stores in Hot Loops of Pairwise Alignment Loop structure vector inth; for(j=1; j< N; ++j) { } h[0..3] = HH[j..j+3]; // operations on h; HH[j..j+3] = h[0..3]; Code structure results from vectorization Memory accesses HH[j] iteration 0 iteration 1 iteration 2 iteration 3 iteration 4 40

29 Unaligned Vector Loads on the Cell Unaligned vector load vector intv, qw0, qw1, part0, part1; unsigned in shift; 7 HH[j] HH[j+1] shift = (unsigned int)(&hh[j]) & 15; qw0 = *(( vector int *)&HH[j]); qw1 = *((( vector sint *)&HH[j])+1); part0 = spu_slqwbyte( qw0, shift ); part1 = spu_rlmaskqwbyte( qw1, (signed int)( shift - 16 ) ); v = spu_or( part0, part1 ); 41

30 Unaligned Vector Stores on the Cell Unaligned vector store vector intv, qw0, qw1, merge0, merge1; vector unsigned int mask; unsigned int shift; shift = (unsigned int)(&hh[j]) & 15; qw0 = *(( vector int *)&HH[j]); qw1 = *((( vector int *)&HH[j])+1); mask = ( vector unsigned int) spu_rlmaskqwbyte( spu_promote((unsigned char)0xff, 0), -shift); v = spu_rlqwbyte( v, -shift ); merge0 = spu_sel( qw0, v, mask ); merge1 = spu_sel( v, qw1, mask ); *( vector int *)(&HH[j-3]) = merge0; *(( vector int *)(&HH[j-3])+1) = merge1; 10 HH[j] HH[j+1] HH[j] HH[j+1] 42

31 HH[j] iteration 0 iteration 1 iteration 2 iteration 3 iteration 4 Unaligned Vector Loads and Stores in Hot Loops of Pairwise Alignment Memory accesses address HH[1] is vector-aligned iterations access 2 aligned vectors Optimizations Unroll loop 4 times Ensure first iteration accesses aligned elements Cache memory values in registers Remove redundant computations One aligned vector load and store per cycle in regime 43

32 Execution time (secs) Optimizing for the SPU: Loop Unrolling to avoid unaligned DMAs Unaligned vector access 4*17 = 68 instructions if unoptimized Reduced to 24 instructions in regime Optimization applies to multiple arrays Optimization applied only to regime loop Pairwise Alignment x

33 Execution time (secs) Code size (KB) Optimizing for the SPU: Pairwise Alignment Pairwise Alignment forward backward 3rd 47

34 Execution time (secs) Code size (KB) Optimizing for the SPU: Progressive Alignment Progressive Alignment x 1.35 x 1.62 x forward backward 48

36 Speedup Parallelization of Pairwise Alignment Scores for every pair of sequences can be calculated independently i j 1 work package Maximize load balance by processing packages in order of decreasing size Pairwise Alignment Number of SPUs 50

37 Parallelization of Progressive Alignment: Inter-loop Parallelism Exploit parallelism between loop nests Progressive alignment Work scheduling forward forward backward backward 3 rd 3 rd 3 rd loop performs less work 51

38 iteration Parallelization of Progressive Alignment: Intra-loop Parallelism (1) Single-threaded for(i=0; i< N; ++i) { s = prfscore(i,j); forward // other computations // with cross-iteration // dependences D(s); } Parallel-Stage Pipeline prf score prf score prf score D(s) D(s) time D(s) 52

39 Parallelization of Progressive Alignment: Intra-loop Parallelism (2) Parallel-Stage: prfscore() NThreads threads execute code in this stage Sequential Stage: D(s) 1 thread executes code in this stage // For thread T of NThreads for(i=0; i< N; ++i) { if( i % Nthreads == T ) { s = prfscore(i,j); put(s, Queue[T]); } } for(i=0; i< N; ++i) { s = get(q[i % NThreads]); D(s); Demonstrated for round-robin distribution In practice take vectorization into account and distribute larger blocks to minimize branches! } 53

40 Speedup Parallelization of Progressive Exploiting both inter-loop and intraloop parallelism Optimized inter-spu communication by using non-blocking SPU-to-SPU local store copies Alignment 4,0 3,5 3,0 2,5 Progressive alignment SPU 2 A-set prfscore SPU 4 A-set prfscore SPU 0 SPU 3 fwd B-set prfscore SPU 1 SPU 5 bwd 3 rd B-set prfscore 2,0 1,5 1,0 0,5 0,0 1 SPU 6 SPU 54

41 Speedup Overall Speedup Pairwise Alignment Progressive Alignment calcprf1 Total Number of SPUs 55

43 Execution Time (seconds) Total Speedup is Limited by Least Parallel Part Progressive Alignment Pairwise Alignment Number of SPUs 57

44 Execution time (seconds) Execution time (seconds) Comparison to Homogeneous Dual Cell BE blade Strongly optimized Multicore Progressive Alignment Pairwise Alignment AMD Opteron x4 cores Parallelized, no optimizations Progressive Alignment Pairwise Alignment Number of SPUs Number of cores 58

45 Execution time (seconds) Execution time (seconds) Comparison to Homogeneous Dual Cell BE blade Strongly optimized Multicore AMD Opteron x4 cores Parallelized, no optimizations 3000 Progressive 2500 Progressive Alignment Time spent in progressive alignment Alignment Pairwise is higher 2000on Cell BE Pairwise Alignment in baseline and after optimization Alignment Number of SPUs Number of cores 59

46 Execution time (seconds) Execution time (seconds) 1-SPU Optimizations Are Compensated by Smart Hardware Dual Cell BE blade Strongly optimized AMD Opteron x4 cores Parallelized, no optimizations 3000 Progressive 2500 Progressive Strongly Alignment optimized 1-SPU implementation Alignment Pairwise as 2000 fast as Pairwise Alignment Alignment 1500 single-threaded on general-purpose fat core Number of SPUs Number of cores 60

47 Execution time (seconds) Execution time (seconds) What if GPP Code is Vectorized? Dual Cell BE blade Strongly optimized AMD Opteron x4 cores Parallelized, no optimizations 3000 Progressive 2500 Progressive Ratio Alignment of AMD/Cell for pairwise alignment is Alignment Pairwise It is plausible 2000that this gap can be Pairwise Alignment bridged by vectorization Alignment Number of SPUs Number of cores 61

48 Execution time (seconds) Execution time (seconds) Overall, Cell BE Looses the Comparison Dual Cell BE blade Strongly optimized AMD Opteron x4 cores Parallelized, no optimizations 3000 Progressive 2500 Alignment AMD is faster by 28% 1500 Progressive Alignment due Pairwise to lack of performance 2000 of Cell BE on Alignment progressive alignment Pairwise Alignment Number of SPUs Number of cores 62

49 Execution time (seconds) Execution time (seconds) Hybrid System Is Fastest Dual Cell BE blade Strongly optimized AMD Opteron x4 cores Parallelized, no optimizations 3000 Progressive 2500 Progressive Execute Alignment pairwise alignment on Cell BE; Alignment Pairwise progressive 2000 alignment on GPP: Pairwise Alignment±40% faster than AMD Alignment Number of SPUs Number of cores 63

50 Looking back Let s say we have done this What performance can we obtain? Significant speedups How much effort? Significant: several person-months What happens to the code? Huge code transformations and duplications Can t really maintain it any more Will we regret this? In this case, the gains compared to homogeneous multicores are negative Lack of parallelism in part of the program 64

51 Conclusion Accelerators contain many simple cores Significant optimization required for single-core performance Each accelerator architecture benefits from different optimizations There is no compiler to do the job for you! Fat general-purpose cores do much of this work for you Performance improvements are worthwhile Optimization is time-consuming, error-prone, architecture specific Nearly intractable to maintain optimized code Beware of Amdahl s Law: require scalable parallelism in all parts of the application 65

52 Acknowledgements Collaborators Sean Rul JorisD Haene MichielQuestier Koen De Bosschere 66