PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5)

Transcription

1 PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5) Rob van Nieuwpoort Vrije Universiteit Amsterdam & Astron, the Netherlands Institute for Radio Astronomy

2 Why Radio? Credit: NASA/IPAC

3 Centaurus A, visible light and radio

4 The Dwingeloo telescope Dwingeloo telescope, 's 25m dish, largest turnable telescope in the world Hydrogen line (21cm), galaxies Dwingeloo I & II Now a national monument

5 Westerbork synthesis radio telescope 14 25m dishes, 3 km Combined in hardware Built in 1970, upgraded in MHz GHz

6 Software radio telescopes (1) 6 Software radio telescopes We cannot keep on building larger dishes Replace dishes with thousands of small antennas Combine signals in software

7 Software radio telescopes (2) Software telescopes are being built now LOFAR: LOw Frequency Array (Netherlands, Europe) ASKAP: Australian Square Kilometre Array Pathfinder MeerKAT: Karoo Array Telescope (South Africa) 2020: SKA, Square Kilometre Array Exa-scale! (10 18 :giga, tera, peta, exa)

8 LOFAR 8 Largest telescope in the world omni-directional antennas Hundreds of gbit/s 14x LHC Hundreds of teraflops MHz 100x more sensitive

9 LOFAR overview Hierarchical Receiver Tile Station Telescope Central processing Groningen IBM BG/P Dedicated fibers

10 LOFAR low-band antennas 10

11 LOFAR high-band antennas 11

12 Station (150m)

13

14 2x3 km

15 Station cabinet

16 Station processing Special-purpose hardware, FPGAs 200 MHz ADC, filtering Send to BG/P Dedicated fiber UDP

17 LOFAR science 17 Imaging Epoch of re-ionization Cosmic rays Extragalactic surveys Transients Pulsars

18 A LOFAR observation Cas A Supernova remnant MHz 12 stations

19

20 Processing pipeline real time offline astronomy pipelines 10 terabit/s 265 DVDs /s 200 gigabit/s 5 DVDs /s 50 gigabit/s 1.3 DVD/s Data volume

21 Processing pipeline real time offline astronomy pipelines 10 terabit/s 265 DVDs /s 200 gigabit/s 5 DVDs /s 50 gigabit/s 1.3 DVD/s Data volume Flexibility

22 Processing pipeline real time offline astronomy pipelines 10 terabit/s 265 DVDs /s 200 gigabit/s 5 DVDs /s 50 gigabit/s 1.3 DVD/s Data volume Flexibility Data intensiveness

23 Processing overview 23

24 Online pipelines

25 Stella, the IBM Blue Gene/P 25 Was #2, now # MHz PowerPC Designed for energy efficiency Complex numbers 3-D torus, collective, barrier, 10 GbE, JTAG networks 2½ racks = 10,880 cores = 37 TFLOP/s + 160*10 Gb/s

26 Optimizations We need high bandwidth, high performance, real-time behavior Use assembly for performance-critical code [SPAA'06] Avoid resource contention by smart scheduling [PPoPP'10] Run part of application on I/O node [PPoPP'08] Use optimized network protocol [PDPTA'09] Modify OS to avoid software TLB miss handler [IJHPC'10] Use real-time scheduler [PPoPP'10] Drop data if running behind [PPoPP'10] Use asynchronous I/O [PPoPP'10]

27 BG/P performance 27 Correlator is O(n 2 ) achieve 96% of the theoretical peak

28 frequency Correlator output 28 time

29 Problem: processing is challenging Special-purpose hardware Inflexible Expensive to design Long time from design to production Supercomputer Flexible Expensive to purchase Expensive maintenance Expensive due to electrical power costs For SKA, we need orders of magnitude more!

30 Many-core advantages 30 Fast and cheap Latest ATI HD 6990 has 3072 cores, 5.1 tflops Costs only 575 euro! Comparison: entire 72-node DAS-4 VU cluster has 4.4 tflops Potentially more power efficient Example: In theory, ATI 4870 GPU is 15 times more power efficient than BG/P Many-cores are becoming more general CPUs are incorporating many-core techniques

31 Research questions Architectural problems: Which part of the theoretical performance can be achieved in practice? Can we get the data into the accelerators fast enough? Performance consistent enough for real-time use? Which architectural properties are essential?

32 Many-cores Intel core i7 quad core + hyperthreading + SSE Sony/Toshiba/IBM Cell/B.E. QS21 blade GPUs: NVIDIA Tesla C1060/GTX280, ATI 4870 Compare with production code on BG/P Compare architectures: Implemented everything in assembly Reader: Rob V. van Nieuwpoort and John W. Romein, Correlating Radio Astronomy Signals with Many-Core Hardware,

33 Essential many-core properties architecture Intel core i7 IBM BG/P ATI 4870 NVIDIA C1060 STI Cell cores x FPUs per core = total FPUs 4 x 4 = 16 4 x 2 = x 5 = x 8 = x 4 = 32 gflops registers/core x width (floats) device RAM bandwidth (GB/s) host RAM bandwidth (GB/s) per operation bandwidth slowdown compared to BG/P 16 x x x x x n.a n.a n.a (host: 150) (host: 117)

34 Correlator algorithm 34 For all channels (63488) For all combinations of two stations (2080) For the combinations of polarizations (4) Complex float sum = 0; For the time integration interval (768 samples) Sum += sample1 * sample2 (complex multiplication) Store sum in memory

35 Correlator optimization 35 Overlap data transfers and computations Exploit caches / shared memory / local store Loop unrolling Tiling Scheduling SIMD operations...

36 Correlator: Arithmetic Intensity 36 Correlator inner loop: for (time = 0; time < integrationtime; time++) { sum += samples[ch][station1][time][pol1] * samples[ch][station2][time][pol2]; } complex multiply-add: 8 flops sample: real + complex float (2 * 4 bytes)

37 Correlator: Arithmetic Intensity 37 Correlator inner loop: for (time = 0; time < integrationtime; time++) { sum += samples[ch][station1][time][pol1] * samples[ch][station2][time][pol2]; } complex multiply-add: 8 flops sample: real + complex float: 8 bytes AI: 8 FLOPS, 2 samples: 8 / 16 = 0.5

38 Correlator AI optimization 38 Combine polarizations complex multiply-add: 8 flops 2 polarizations: X, Y calculate XX, XY, YX, YY 32 flops per square XY-sample: 16 bytes (x2) 1 flop/byte Tiling 1 flop/byte 2.4 flops/byte but, we need registers 1x1 already needs 16!

39 Tuning the tile size tile size floating point operations memory loads (bytes) arithmetic intensity 1 x x x x x x x Minimum # Registers (floats)

40 Correlator implementation Intel Core i7 CPU The Cell Broadband Engine ATI & NVIDIA GPUs

41 Implementation strategy on CPU 41 Partition frequencies over the cores Independent Multithreading Each core computes its own correlation triangle Use tiling: 2x2 Vectorize with SSE Unroll time loop; compute 4 time steps in parallel

42 Implementation strategy on the Cell/BE 42 Partition frequencies over the SPEs Independent Each SPE computes its own correlation triangle Use tiling: 4x4 (128 registers!) Keep a strip of tiles in the local store: more reuse Use double buffering from memory to local store Overlap communication and computation Vectorize Different vector elements compute different polarizations

43 Implementation strategy on GPUs 43 Partition frequencies over the streaming multiprocessors Independent Double buffering between GPU and host Exploit data reuse as much as possible Each streaming multiprocessor computes a correlation triangle Threads/cores within a SM cooperate on a single triangle Load samples into shared memory Use tiling (4x3 on ATI, 3x2 on NVIDIA)

44 44 Evaluation

45 How to cheat with speedups, part 2 45 How can this be? Core I7 CPU has 154 GFLOPs NVIDIA GTX 580 GPU has 1581 GFLOPs (10.3 X more)

46 How to cheat with speedups, part 2 46 Heavily Optimize GPU version Coalescing, Shared memory Tiling, Loop unrolling Do not optimize CPU version 1 core only No SSE Cache unaware No loop unrolling and tiling, Result: very high speedups! Exception: kernels that do interpolations (texturing hardware) Solution Optimize CPU version Use efficiencies: % of peak performance, Roofline

47 Theoretical performance bounds 47 Distinguish between global and local (host vs device) Local AI = Depends on tile size, and # registers Max performance = AI * memory bandwidth ATI (4x3): 3.43 * = 395 gflops Peak of 1200 needs AI of 10.4 or 350 GB/s bandwidth NVIDIA (3x2): 2.40 * = 245 gflops Peak of 996 needs AI of 9.8 or 415 GB/s bandwidth Can we achieve more than this?

48 Theoretical performance bounds 48 Global AI = #stations + 1 (LOFAR: 65) Max performance = AI * memory bandwidth Max performance GPUs, with AI global: ATI: 65 * 4.6 = 300 gflops need 19 GB/s for peak NVIDIA: 65 * 5.6 = 363 gflops need 15 GB/s for peak

49 Correlator performance 49

50 Measured power efficiency 50 Current CPUs (even at 45 nm) still are less power efficient than BG/P (90 nm) GPUs are not 15, but only 2-3x more power efficient than BG/P 65 nm Cell is 4x more power efficient than the BG/P

51 gflops Scalability on NVIDIA GTX number of stations

52 Weak and strong points 52 Intel Core i7 IBM BG/P ATI 4870 NVIDIA Tesla C well-known toolchain - few registers - limited shuffling + L2 prefetch unit + high memory bandwidth - double precision only - expensive + largest # cores + shuffling support - low PCI-e bandwidth (4.6 GB/s) - transfer slows down kernel - CAL is low-level - bad Brook+ performance - not well documented STI Cell + Cuda is high-level + explicit cache (LS) + shuffle capabilities + power efficiency - low PCI-e bandwidth (5.6 GB/s) - multiple parallelism Levels (6!) - no increment in odd pipeline

53 Conclusions Software telescopes are the future, extremely challenging Software provides the required flexibility Many-core architectures show great potential (28x) PCI-e is a bottleneck Compared to the BG/P or CPUs, the many-cores have low memory bandwidth per operation This is OK if the architecture allows efficient data reuse Optimal use of registers (tile size + SIMD strategy) Exploit caches / local memories / shared memories The Cell has 8 times lower memory bandwidth per operation, but still works thanks to explicit cache control and large number of registers

54 Backup slides

55 Vectorizing the correlator 55 How do we efficiently use the vectors? for (pol1 = 0; pol1 < nrpolarizations; pol1++) { for (pol2 = 0; pol2 < nrpolarizations; pol2++) { float sum = 0.0; for (time = 0; time < integrationtime; time++) { sum += samples[ch][station1][time][pol1] * samples[ch][station2][time][pol2]; } } }

56 Vectorizing the correlator 56 Option 1: vectorize over time Unroll time loop 4 times for (pol1 = 0; pol1 < nrpolarizations; pol1++) { for (pol2 = 0; pol2 < nrpolarizations; pol2++) { float sum = 0.0; for (time = 0; time < integrationtime; time += 4) { sum += samples[ch][station1][time+0][pol1] * samples[ch][station2][time+0][pol2]; sum += samples[ch][station1][time+1][pol1] * samples[ch][station2][time+1][pol2]; sum += samples[ch][station1][time+2][pol1] * samples[ch][station2][time+2][pol2]; sum += samples[ch][station1][time+3][pol1] * samples[ch][station2][time+3][pol2]; } } }

57 Vectorizing the correlator 57 for (pol1 = 0; pol1 < nrpolarizations; pol1++) { for (pol2 = 0; pol2 < nrpolarizations; pol2++) { vector float sum = {0.0, 0.0, 0.0, 0.0}; for (time = 0; time < integrationtime; time += 4) { vector float s1 = { samples[ch][station1][time+0][pol1], samples[ch][station1][time+1][pol1], samples[ch][station1][time+2][pol1], samples[ch][station1][time+3][pol1], }; vector float s2 = { samples[ch][station2][time+0][pol2], samples[ch][station2][time+1][pol2], samples[ch][station2][time+2][pol2], samples[ch][station2][time+3][pol2], }; sum = spu_madd(s1, s2, sum); // sum = sum + s1 * s2 } result = sum.x + sum.y + sum.z + sum.w; } }

58 Vectorizing the correlator 58 Option 2: vectorize over polarization for (pol1 = 0; pol1 < nrpolarizations; pol1++) { for (pol2 = 0; pol2 < nrpolarizations; pol2++) { float sum = 0.0; for (time = 0; time < integrationtime; time++) { sum += samples[ch][station1][time][pol1] * samples[ch][station2][time][pol2]; } } }

59 Vectorizing the correlator 59 Option 2: vectorize over polarization Remove polarization loops (4 combinations) float sum = 0.0; for (time = 0; time < integrationtime; time++) { sum += samples[ch][station1][time][0] * samples[ch][station2][time][0]; // XX sum += samples[ch][station1][time][0] * samples[ch][station2][time][1]; // XY sum += samples[ch][station1][time][1] * samples[ch][station2][time][0]; // YX sum += samples[ch][station1][time][1] * samples[ch][station2][time][1]; // YY }

60 Vectorizing the correlator 60 vector float sum = {0.0, 0.0, 0.0, 0.0}; for (time = 0; time < integrationtime; time++) { vector float s1 = { samples[ch][station1][time][0], samples[ch][station1][time][0], samples[ch][station1][time][1], samples[ch][station1][time][1], }; vector float s2 = { samples[ch][station2][time][0], samples[ch][station2][time][1], samples[ch][station2][time][0], samples[ch][station2][time][1], }; sum = spu_madd(s1, s2, sum); // sum = sum + s1 * s2 // sum now contains {XX, XY, YX, YY} }

61 Delay Compensation

62 It's all about the memory feature Cell/B.E. GPUs access times uniform non-uniform cache sharing level single thread (SPE) all threads in a multiprocessor access to off-chip memory not possible, only through DMA memory access overlapping asynchronous DMA supported Hardware-managed thread preemption (tens of thousands of threads) communication communication between SPEs through EIB independent thread blocks + shared memory within a block 62