Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Ashwin Aji, Wu Feng, Filip Blagojevic and Dimitris Nikolopoulos
Forecast Efficient mapping of wavefront algorithms on the Cell Broadband Engine Double buffering and data streaming across the cores Unique data layout optimizations within the cores Developing an accurate performance prediction model End Result Achieving near-linear scalability or near-constant efficiency with respect to the number of cores on the chip 2
Outline Introduction Mapping and Modeling Wavefront Algorithms on the Cell Broadband Engine Optimizations for the Accelerator Cores Evaluation / Results Conclusion 3
Outline Introduction The Cell Broadband Engine (B.E.) The Cell B.E. Architecture The QS20 Cell Blade The Wavefront Pattern Mapping and Modeling Wavefront Algorithms on the Cell Broadband Engine Optimizations for the Accelerator Cores Evaluation / Results Conclusion 4
The Cell Broadband Engine Highlights 9 cores, 10 threads 3.2 GHz frequency > 200 GFlops (SP) Up to 25 GB/s memory B/W > 300 GB/s EIB Source: IBM Corporation 5
The Cell B.E. Architecture SPE SPU SXU SPU SXU SPU SXU SPU SXU SPU SXU SPU SXU SPU SXU SPU SXU LS LS LS LS LS LS LS LS MFC MFC MFC MFC MFC MFC MFC MFC EIB (up to 96B/cycle) PPE 16B/cycle 16B/cycle 16B/cycle (2x) PPU MIC BIC L2 32B/cycle L1 16B/cycle PXU Dual XDR TM FlexIO TM 64-bit Power Architecture with VMX Source: IBM Corporation 6
The QS20 Cell Blade Source: IBM Corporation 1GB XDR Memory Cell Processors IO Controllers IBM Blade Center interface 7
Outline Introduction The Cell Broadband Engine (B.E.) The Cell B.E. Architecture The QS20 Cell Blade The Wavefront Pattern Mapping and Modeling Wavefront Algorithms on the Cell Broadband Engine Optimizations for the Accelerator Cores Evaluation / Results Conclusion 8
The Wavefront Pattern NW N W Dependency Areas of utility Computational Biology: Smith-Waterman Linear algebra: LU Decomposition Multimedia: Video Encoding Computational Physics: Particle Physics Simulations 9
Outline Introduction Mapping and Modeling Wavefront Algorithms on the Cell Broadband Engine Tiled-Wavefront Model for Performance Prediction Optimizations for the SPEs Evaluation / Results Conclusion 10
Mapping to the Cell B.E. SPEs Each element is processed on individual SPEs (S i ) Each diagonal is computed in parallel Bus overhead due to concurrent DMA calls (reads and writes) Scalability issue S 1 S 1 S 1 S 1 S 1 S 1 S 2 S 2 S 2 S 2 S 2 S 3 S 3 S 3 S 3 S 4 S 4 S 4 S 5 S 5 S 6 Element Interconnect Bus Main Memory Matrix store Naïve mapping 11
Mapping to the Cell B.E. Elements are grouped to form square tiles Larger granularity Tile dimension can be modified Each tile is processed on individual SPEs (S i ) Each tile-diagonal is computed in parallel Tile S 1 S 1 S 1 Tile-column S 2 S 2 S 3 Tile-row Tiled-Wavefront 12
Tile-Scheduling Cyclic assignment of the SPEs to tile-rows S 1 t 1 t 2 t 3 Direction of Computation Tile-row t 4 t 5 t 6 t 7 t 8 S 2 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Block-row: group of active tile-rows Block-row S 3 S 4 t 3 t 4 t 4 t 5 t 5 t 6 t 6 t 7 Block-row t t t 7 t 8 t 8 t 9 t 9 t 10 t 10 t 11 Computation overlap between consecutive blockrows (t 9 t 13 ) S 5 S 6 S 1 t 5 t 6 t 6 t 9 t 7 t 10 t 7 t 8 t 11 t 8 t 9 t 12 t 9 t 10 t 13 t 10 t 11 t 14 t 11 t 12 t 15 t 12 t 13 t 16 t 10 t 11 t 12 t 13 S 2 t 14 t 15 t 16 t 17
Tile-Scheduling (continued ) S= number of active SPEs Siterations before all SPEs are fully utilized Block-row S tiles
Computation-Communication Pattern X X X X X X West Tile Buffer copy X X X X X X X X X X X X X X X X X North Tile DMA Ready message South Tile DMA tile to main memory East Tile 15
Model for Performance Prediction T = (TT one_tile_diagonal = (T Tile + T T)*(number = DMA matrix_filling parallel_code * [(m of * n) tile-diagonals)+t serial_code + S] + T serial_code serial_code Time for processing a tilediagonal in a block row (or a single tile) =(T tile +T DMA ) Independent of the number of tiles in the tile-diagonal Number of tile-diagonals =(m*n)+s Model Usage: Sampling Phase: measure T tile,t DMA and T serial_code Calculate m, n and S from the input problem size, tile dimension and number of available SPEs mblock rows n tile-diagonals Computation overlap Tiled-Matrix 16 Tile-diagonal Block Row S tile-diagonals
Outline Introduction Mapping and Modeling Wavefront Algorithms on the Cell Broadband Engine Optimizations for the SPEs Tile Representation Vector Computations Evaluation / Results Conclusion 17
Tile Representation Logical representation 0 0 0 0 0 0 0 3 1 0 0 0 1 6 4 0 0 0 4 5 0 0 0 2 7 Physical representation 0 0 0 0 0 0 0 0 3 0 0 0 1 1 0 0 0 6 0 0 4 4 2 5 7 18
Vector Computations Goal: Vectorize as much as possible 0 0 0 0 0 9 8 5 5 0 5 5 0 3 7 2 3 0 0 2 5 3 7 2 0 0 2 3 6 0 0 0 4 2 4 Serial computations 0 9 7 1 3 6 1 2 6 8 19
Outline Introduction Mapping and Modeling Wavefront Algorithms on the Cell Broadband Engine Optimizations for the SPEs Evaluation / Results Experimental Setup Scalability Charts Performance Model Verification Conclusion 20
Experimental Setup Compute Platforms QS20 dual-cell blade at Georgia Tech for the parallel implementation 2.8 GHz dual-core Intel processor with 2GB memory for the serial implementation Example Wavefront Algorithm: Smith-Waterman A fundamental algorithm in bioinformatics used for homology search Matches nucleotide/protein sequences 8 different sequences chosen from the range: 1KB 8KB Two-phase dynamic programming Matrix filling (wavefront pattern) Backtracing (sequential code) 21
Scalability Chart Sequence size: 8KB Wavefront matrix size: 8000x8000 integers Similar results for all other input sequence sizes (1KB 8KB) Why are the sequence sizes < 8KB? The matrix overflows the 1GB XDRAM Why are the tile dimensions < 64x64? The tile overflows the 256KB local store Efficiency Speedup 251 0.9 0.8 20 0.7 0.6 15 0.5 0.4 10 0.3 0.2 5 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Tile Dim. Size = 64x64 Tile Dim. Size = 16x16 Number of SPEs Tile Size Dim. = 32x32 Tile Size Dim. = 8x8 Near-constant efficiency irrespective of the tile dimensions 22
Performance Model Verification Sequence size: 8KB Wavefront matrix size: 8000x8000 integers Tile dimension: 64x64 integers 7100% 90% 6 80% 5 70% 460% 50% 3 40% 230% Similar results for all 20% 1 10% other input configs. Sequence sizes (1KB 8KB) Tile dimensions 32x32, 16x16 and 8x8 integers Execution Time (normalized) (seconds) 0 0% 1 21 23 34 45 56 67 7 8 8 9 91010 11 111212 13 14 15 16 Measured Number of of SPEs Predicted Mean error rate = 3%; Max. error rate = 10% 23
Performance Model Why do we need the performance model? Predict the execution time offline, based on pluggable input parameters 100% 90% 80% 70% 60% 50% 40% 30% Evaluate the tradeoffs 20% 10% between different input 0% configurations before actually deploying the application See section Sequence Throughput in the paper Execution Time (normalized) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Measured Number of SPEs Predicted Mean error rate = 3%; Max. error rate = 10% 24
Outline Introduction Mapping and Modeling Wavefront Algorithms on the Cell Broadband Engine Optimizations for the SPEs Evaluation / Results Conclusion and Future Work 25
Conclusion Efficiently mapped wavefront algorithms on the Cell Broadband Engine Developed a highly scalable design that streams tiles across the SPEs Unique data layout scheme to maximize the vector processing capabilities of the SPEs Accurate prediction model of the execution time based on a number of pluggable parameters 26
Future Work Validation of the tiled-wavefront approach for other wavefront algorithms and also other emergent CMP architectures (e.g., GPU) Integrate the parallelized Smith-Waterman code into sequence search toolkits Extend the design to a cluster of Cell-based nodes For more information CS @ VT: www.cs.vt.edu The SyNeRGy Lab: synergy.cs.vt.edu Center for High-End Computing Systems (CHECS): www.checs.eng.vt.edu Contacts: Ashwin Aji: aaji@cs.vt.edu Wu Feng: feng@cs.vt.edu 27
Related Work: Smith-Waterman on the Cell IBM s approach Coarse-grained parallelization One sequence pair one SPE Max seq. size = 2048 O(m) space No backtrace? Which gap penalty? Linear or affine? Our approach Fine-grained parallelization One sequence pair all available SPEs Max seq. size ~= 8200 O(mn) space Includes backtrace Affine gap penalty More realistic scenario but requires more memory 28